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Abstract. The problem of small area estimation (SAE) is how to pro- 
duce reliable estimates of characteristics of interest such as means, 
counts, quantiles, etc., for areas or domains for which only small sam- 
ples or no samples are available, and how to assess their precision. The 
purpose of this paper is to review and discuss some of the new impor- 
tant developments in small area estimation methods. Rao [Small Area 
Estimation (2003)] wrote a very comprehensive book, which covers all 
the main developments in this topic until that time. A few review pa- 
pers have been written after 2003, but they are limited in scope. Hence, 
the focus of this review is on new developments in the last 7-8 years, 
but to make the review more self-contained, I also mention shortly 
some of the older developments. The review covers both design-based 
and model-dependent methods, with the latter methods further classi- 
fied into frequentist and Bayesian methods. The style of the paper is 
similar to the style of my previous review on SAE published in 2002, 
explaining the new problems investigated and describing the proposed 
solutions, but without dwelling on theoretical details, which can be 
found in the original articles. I hope that this paper will be useful both 
to researchers who like to learn more on the research carried out in 
SAE and to practitioners who might be interested in the application of 
the new methods. 
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1. PREFACE 

The problem of small area estimation (SAE) is 
how to produce reliable estimates of characteristics 
of interest such as means, counts, quantiles, et cetera, 
for areas or domains for which only small samples 
or no samples are available. Although the point esti- 
mators are usually of first priority, a related problem 
is how to assess the estimation (prediction) error. 

The great importance of SAE stems from the fact 
that many new programs, such as fund allocation for 
needed areas, new educational or health programs 
and environmental planning rely heavily on these es- 
timates. SAE techniques are also used in many coun- 
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tries to test and adjust the counts obtained from 
censuses that use administrative records. 

In 2002 I published a review paper with a similar 
title (Pfeffermann (2002)). In that year small area 
estimation (SAE) was flourishing both in research 
and applications, but my own feeling then was that 
the topic has been more or less exhausted in terms 
of research and that it will just turn into a routine 
application in sample survey practice. As the past 9 
years show, I was completely wrong; not only is the 
research in this area accelerating, but it now involves 
some of the best known statisticians, who otherwise 
are not involved in survey sampling theory or appli- 
cations. The diversity of new problems investigated 
is overwhelming, and the solutions proposed are not 
only elegant and innovative, but also very practical. 

Rao (2003) published a comprehensive book on 
SAE that covers all the main developments in this 
topic up to that time. The book was written about 
ten years after the review paper of Ghosh and Rao 
(1994), published in Statistical Science, which stim- 
ulated much of the early research in SAE. Since 
2003, a few other review papers have been pub- 
lished; see, for example, Rao (2005, 2008), Jiang and 
Lahiri (2006a, 2006b), Datta (2009) and Lehtonen 
and Veiganen (2009). Notwithstanding, SAE is re- 
searched and applied so broadly that I decided that 
the time is ripe for a new comprehensive review that 
focuses on the main developments in the last 7- 
8 years that I am aware of, and which are hardly 
covered in the review papers mentioned above. The 
style of the paper is similar to the style of my pre- 
vious review, explaining the problems investigated 
and describing the proposed solutions, but without 
dwelling on theoretical details, which can be found 
in the original articles. For further clarity and to 
make the paper more self-contained, I start with a 
short background and overview some of the "older" 
developments. I hope that this paper will be useful 
to researchers who wish to learn about the research 
carried out in SAE and to practitioners who might 
be interested in applying the new methods. 

2. SOME BACKGROUND 

The term "SAE" is somewhat confusing, since it 
is the size of the sample in the area that causes 
estimation problems, and not the size of the area. 
Also, the "areas" are not necessarily geographical 
districts and may define another grouping, such as 
socio-demographic groups or types of industry, in 



which case they are often referred to as domains. 
Closely related concepts in common use are "poverty 
mapping" or "disease mapping," which amount to 
SAE of poverty measures or disease incidence and 
then presenting the results on a map, with different 
colors defining different levels (categories) of the es- 
timators. What is common to most small area esti- 
mation problems is that point estimators and error 
measures are required for every area separately, and 
not just as an average over all the areas under con- 
sideration. 

SAE methods can be divided broadly into "design- 
based" and "model-based" methods. The latter meth- 
ods use either the frequentist approach or the full 
Bayesian methodology, and in some cases combine 
the two, known in the SAE literature as "empirical 
Bayes." Design-based methods often use a model 
for the construction of the estimators (known as 
"model assisted"), but the bias, variance and other 
properties of the estimators are evaluated under the 
randomization (design-based) distribution. The ran- 
domization distribution of an estimator is the dis- 
tribution over all possible samples that could be se- 
lected from the target population of interest under 
the sampling design used to select the sample, with 
the population measurements considered as fixed 
values (parameters). Model-based methods on the 
other hand usually condition on the selected sample, 
and the inference is with respect to the underlying 
model. 

A common feature to design- and model-based 
SAE is the use of auxiliary covariate information, as 
obtained from large surveys and/or administrative 
records such as censuses and registers. Some esti- 
mators only require knowledge of the covariates for 
the sampled units and the true area means of these 
covariates. Other estimators require knowledge of 
the covariates for every unit in the population. The 
use of auxiliary information for SAE is vital because 
with the small sample sizes often encountered in 
practice, even the most elaborated model can be 
of little help if it does not involve a set of covari- 
ates with good predictive power for the small area 
quantities of interest. 

3. NOTATION 

Consider a population U of size N, divided into 
M exclusive and exhaustive areas U\ U • • • U Um with 
Ni units in area i, YliLx = N. Suppose that sam- 
ples are available for m < M of the areas, and let 
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s = si U • • • U s m define the overall sample, where Si 
of size rii is the sample observed for sampled area 
i, Y^i n i =n - Note that rii is random unless a 
planned sample of fixed size is taken in that area. 
Let y define the characteristic of interest, and de- 
note by yij the response value for unit j belonging 
to area i, i = 1, . . . , M , j = 1, . . . , Ni with sample 
means yi = YTj=\Vnl n i^ where we assume without 
loss of generality that the sample consists of the 
first rii units. We denote by Xjj = (xuj, . . . , x p ij)' 
the covariate values associated with unit and 
by Xj = Yyj=i x ij/ n i the column vector of sample 
means. The corresponding vector of true area means 
is X{ = ^2f=iX-ij/Ni. The area target quantity is de- 
noted by Of, for example, 6{ = Yi = Ylj=i Vij/Ni, the 
response area mean. Estimating a proportion is a 
special case where is binary. In other applica- 
tions Oi may represent a count or a quantile. 

4. DESIGN-BASED METHODS 

4.1 Design-Based Estimators in Common Use 

A recent comprehensive review of design-based 
methods in SAE is provided by Lehtonen and Veija- 
nen (2009). Here I only overview some of the basic 
ideas. Suppose that the sample is selected by simple 
random sampling without replacement (SRSWOR) 
and that the target quantities of interest are the 
means Y{. Estimation of a mean contains as special 
cases the estimation of a proportion and the estima- 
tion of the area distribution Fi(t) = ^2j eUi Vij/Ni, in 
which 

= HlJij — where 1(A) is the indi- 
cator function. Estimators of the percentiles of the 
area distribution are commonly obtained from the 
estimated distribution. 

If no covariates are available the direct design- 
unbiased estimator of the area mean and its condi- 
tional design variance over the randomization dis- 
tribution for given rii are given by 



II, 



(4.1) 
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V D [y l \n l ] = (Sf/n i )[l-(n i /N i )], 

where Sf = Ef=i(Vij_ ~ YifK^i - 1). The term "di- 
rect" is used to signify an estimator that only uses 
the data available for the target area at the specific 
time of interest. The variance Vo[j/i|^i] is 0(1 /rii), 
and for small rii it is usually large, unless Sf is suf- 
ficiently small. 

Next suppose that covariates Xy are also observed 
with xuj = 1 . An estimator in common use that uti- 



lizes the covariate information is the synthetic esti- 
mator, 



1 



Ni 



i=i 



where B 



the ordinary least square (OLS) estimator. Notice 
that under SRSWOR, B is approximately design- 
unbiased and consistent for the vector B of regres- 
sion coefficients computed from all the population 
values, irrespective of whether a linear relationship 
between y and x exists in the population. The design- 
unbiasedness and consistency are with respect to 
the randomization distribution, letting N and n in- 
crease to infinity in a proper way. An estimator is 
approximately design-unbiased if the randomization 
bias tends to zero as the sample size increases. The 
term "synthetic" refers to the fact that an (approx- 
imately) design-unbiased estimator computed from 
all the areas (B in the present case) is used for every 
area separately, assuming that the areas are "homo- 
geneous" with respect to the quantity being esti- 
mated. Thus, synthetic estimators borrow informa- 
tion from other "similar areas" and they are there- 
fore indirect estimators. 

The obvious advantage of the synthetic estima- 
tor over the simple sample mean or other direct es- 

^ dir 

timators such as the regression estimator Y r 
Vi + (Xi 



reg,« 

Xi)' Bi, where Bi is computed only from 

- s y n 

the data observed for area i, is that Yai£>(Y res i ) = 
0(l/n), and n = YlT=i n i ls usually large. The use of 
the synthetic estimator is motivated ("assisted") by 
a linear regression model of y on x in the population 
with a common vector of coefficients. However, for 

~ syn _ _ 

x Uj = 1, E D (Y iegii - Y-) =i -X[(Bi - B), where B, 
is the OLS computed from all the population values 
in area i. Thus, if in fact different regression coef- 
ficients Bi operate in different areas, the synthetic 
estimator may have a large bias. When the sam- 
ple is selected with unequal probabilities, the OLS 
estimator B in (4.2) is commonly replaced by the 
probability weighted (PW) estimator 



B 



pw 



m rii 

/I / I w ij x ij x ij 
i=l j=l 



rii 

w ij x ijyij 

=1 



=1 r- 



where {wij = 1/Pr[(i, j) G s]} are the base sampling 
weights. 

In order to deal with the possible large bias of the 
synthetic estimator, it is common to estimate the 
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bias and then subtract it from the synthetic estima- 
tor. The resulting survey regression estimator takes 
the form 



A S— R 1 



pwj 



(4.3) 



3=1 



PW! 



where (Y"i h—Tj A^,h-t) are the Horvitz-Thompson 
(H-T) estimators of (Y^Xj). The estimator (4.3) 
is approximately design-unbiased and performs well 
when the covariates have good predictive power, but 
the variance is back to order 0(l/nj). The vari- 
ance is often reduced by multiplying the bias correc- 
tion V" | '/•,,!.'/,, -x'^Bp^/Ni by Ni/Y^Li^ij = 
Ni/Ni. 

A compromise between the possibly large bias of 
the synthetic estimator and the possibly large vari- 
ance of the survey regression estimator is achieved 
by taking a linear combination of the two. The re- 
sulting combined (composite) estimator is defined as 



(4-4) Y { 



. COM 



± S-R * syn 

SiYi + (1 - 5i)Y : 



0<6i< 1. 



Ideally, the coefficient <5j should be chosen to min- 

~ COM 

imize the mean square error (MSE) of Y ^ , but 
assessing sufficiently accurately the bias of the syn- 
thetic estimator for a given area is usually impos- 
sible. Hence, it is common to let Si depend on the 
sample size in the area, such that the larger rij, 
the larger is Si. See Rao (2003) for review of other 
combined estimators, and methods of specifying Si. 

4.2 Some New Developments in Design-Based 
Small Area Estimation 

A general class of estimators is obtained by cali- 
brating the base sampling weights Wij. Suppose that 
the population can be partitioned into C calibra- 



tion groups U 



[7 (1) U 



U t^(c) with known totals 



that each area Ui belongs to one of the groups. Let 
s = S(i) U • • • U S(c) define the respective partitioning 
of the sample. In a special case C = 1 and [7m = U. 
The calibrated estimator of the mean Yi is computed 

as 



(4.5) 



~ cal 



rii 



E 

M'es {c ) 



w ij-*-ij — ^x(c) • 



The calibration weights {wij} are chosen so that 
they minimize an appropriate distance from the base 
weights {w^}, subject to satisfying the constraints 



t x r c ) . For example, when using the 



distance X 2 = Eijea (c) Hj 
the calibrated weights are 



Wi 



l /wij and xuj = 1, 



w 



ij — WijQij] 



(4.6) 9ij = il + (t x(c) - t 



c x(c),H-Tj 



E 



UJij~X.ij~Xij 



where i x (c),H— T is the H-T estimator of the total 
^x(c)- When U c = Ui (the calibration group is the 

~ cal 

domain), Y i is the familiar generalized regression 
(GREG) estimator in the domain. 

Calibration of the sampling weights is in broad 
use in sample survey practice, not only for SAE. 
See Kott (2009) for a recent comprehensive review 
and discussion. The rationale of the use of calibrated 
estimators in SAE is that if y is approximately a 

\c) 



linear combination of x in U( c \, then Yi = X'-B, 



t 



x(c)i 



for domains i E U c , and since Ej 

Y% = J2jLi w ijUij/Ni is expected to be a good es- 
timator of Yi. Indeed, the advantage of estimator 
(4.5) over (4.2) is that it is assisted by a model 
that only assumes common regression coefficients 
within the groups Ur c \, and not for all the domains, 
as implicitly assumed by estimator (4.2). Estimator 
(4.5) is approximately design-unbiased irrespective 

~ cal 

of any model, but "Var£)(Y' i |n,) = 0(l/nj), which 
may still be large. 

Another way of calibrating the weights is by use of 
instrumental variables (Estevao and Sarndal, 2004, 
2006). Denote the vector of instrument values for 
unit (i,j) by hij. The calibrated weights are defined 
as 



W: 



Wi 



t x ( c ) of the auxiliary variables in the groups, such 

(4.7) 



(*x(c) 



f x(c),H-T 



)' 



i] hij^ij 



Note that the instrument values need only be 
known for the sampled units in sr c ) and that 



t cx , thus satisfying the same con- 



straints as before. The calibrated estimator of Yi is 
now Y?L. = EZi Wij* ■ Vij/ N i- When h = x, w ins - 



W i3- 



The use of instruments replaces the search for 



an appropriate distance function by imposing a struc- 
ture on the calibration weights, and it allows one, in 
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principle, to find the best instruments in terms of 
minimizing an approximation to the variance of the 
calibrated estimator. However, as noted by Estevao 
and Sarndal (2006), the resulting optimal weights 
depend on unknown population quantities which, 
when estimated from the sample, may yield unstable 
estimators. See Kott (2009) for further discussion. 

The synthetic estimator (4.2), the survey regres- 
sion estimator (4.3) and the various calibrated es- 
timators considered above are all assisted by mod- 
els that assume a linear relationship between y and 
x. These estimators only require knowledge of the 
covariates for the sampled units, and the area (or 
group) totals of these covariates. Lehtonen, Sarndal 
and Veijanen (2003, 2005) consider the use of gener- 
alized linear models (GLM), or even generalized lin- 
ear mixed models (GLMM) as the assisting models, 
which require knowledge of the covariates for every 
element in the population. Suppose that EM{Vij) = 
f(xij]ip) for some nonlinear function /(•) with an 
unknown vector parameter ip, where Em(-) defines 
the expectation under the model. A simple impor- 
tant example is where f(pqj]ip) is the logistic func- 
tion. Estimating ip by the pseudo-likelihood (PL) ap- 
proach yields the estimator ip p y and predicted values 

{Vij — f( x ij'i V'pi)}- The PL approach consists of es- 
timating the likelihood equations that would be ob- 
tained in case of a census by the corresponding H-T 
estimators (or weighting each score function by its 
sampling weight), and then solving the resulting es- 
timated equations. The synthetic and "generalized 
GREG" estimators are computed as 

^syn 1 Ni 

^GLM,i = JT X/ /( X *J^pl); 
% J'=l 

- GREG * syn 
(4.8) ^GLM,i — ^GLM,i 

+ jf. w v bus - f( x v ; • 

1 3=1 

A further extension is to include random area ef- 
fects in the assisting model, assuming EM(yij\xij, 
Ui) = f(xij,Ui; ip*), E M (ui)=0, Var M («i) = o-«- Es- 
timation of the fixed parameters ip* , a\ and the ran- 
dom effects Ui is now under the model, ignoring the 
sampling weights. The extended synthetic and gen- 
eralized GREG estimators are defined similarly to 
(4.8), but with /(xjjj^pi) replaced by f(x.ij,Ui;ip*). 
For sufficiently large sample size n,, the extended 
generalized GREG is approximately design-unbiased 
for the true area mean, but it is not clear how to es- 



timate the design (randomization) variance in this 
case in a way that accounts for the prediction of 
the random effects. Torabi and Rao (2008) compare 
the MSE of model-based predictors and a GREG 
assisted by a linear mixed model (LMM). 

Jiang and Lahiri (2006a) propose the use of model- 
dependent estimators that are design-consistent un- 
der the randomization distribution as the area sam- 
ple sizes increase. The basic idea is to model the 

direct estimators Y ' iw = Y^,=\ w iiViilY^,=\ w ii m_ 
stead of the individual observations yij, and then 
employ the empirical best predictor of the area mean 
under the model. The authors consider the general 

two-level model E M [Yiw\ui] = & = £,{ui,X iV) ;ip), 
where the UiS are independent random area effects 
with zero mean and variance a^, Xi w = Y^jLi Wij^j/ 
Sj=i w iji an d £(•) is some known function with un- 
known parameters ip. The empirical best predictor 
is the best predictor under the model (minimum ex- 
pected quadratic loss), but with the parameters ip 

-EBP 

replaced by model consistent estimators; Y i = 

Em{€i\Y iw,Xi w ;ip). The estimator is shown to be 
model-consistent under correct model specification 
and design-consistent for large rii, even if the model 
is misspecified, thus robustifying the estimation. The 
authors develop estimators of the prediction mean 
squared error (PMSE) for bounded sample sizes rn, 
with bias of desired order o(l/m), where m is the 
number of sampled areas. The PMSE is computed 
with respect to the model holding for the individ- 
ual observations and over the randomization distri- 
bution. The use of design consistent estimators in 
SAE is somewhat questionable because of the small 
sample sizes in some or all of the areas, but it is 
nonetheless a desirable property. This is so because 
it is often the case that in some of the areas the 
samples are large, and it is essential that an esti- 
mator should work well at least in these areas, even 
if the model fails. Estimators with large randomiza- 
tion bias even for large samples do not appeal to 
practitioners. 

Chandra and Chambers (2009) propose the use of 
model-based direct estimators (MBDE). The idea is 
to fit a model for the population values, compute the 
weights defining the Empirical Best Linear Unbiased 
Predictor (EBLUP) of the population total under 
the model and then use the weights associated with 
a given area to compute an almost direct estimator. 
The model fitted for the population values Yjj is the 
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general linear model, 

Y U = X U (3 + £ U ; 

(4.9) 



E( £u ) = 0, 



E(eue'u 



where s signifies the sample of size n, and r sig- 
nifies the sample-complement of size (N — n). As 
seen later, the models in common use for SAE de- 
fined by (5.1) and (5.3) below are special cases of 
(4.9). Let y s denote the column vector of sample 
outcomes. For known S, the BLUP of the popula- 
tion total t y = Ylk=i y k un der the model is 

iy LVP = l' n y s + ljv_ n [X r ^GLS 



(4.10) 



+ £ rs £- 1 (y s -X s /3 GLS )] 



E 



where l' k is a row vector of ones of length k, X s {X r ) 
is the design matrix corresponding to the sampled 
(nonsampled) units and /3qls is the generalized least 
square estimator. The EBLUP is i^ BLUP = 

E 



k£s 



EBLUP 



yk, where the EBLUP weights are the 
same as in (4.10), but with estimated parameters. 
The MBDE of the true mean in area i is 



(4.ii) rr D = E»f LUP w/E' 

jest jesi 

The authors derive estimators for the bias and vari- 
ance of the MBDE and illustrate its robustness to 

certain model misspecifications. Note, however, that 
- MBD 

Y i is a ratio estimator and therefore may have a 
nonnegligible bias in areas i with small sample size. 

All the estimators considered so far assume a given 
sampling design with random area sample sizes. 
When the target areas are known in advance, con- 
siderable gains in efficiency can be achieved by mod- 
ifying the sampling design and in particular, by con- 
trolling the sample sizes within these areas. In a 
recent article, Falrosi and Righi (2008) propose a 
general strategy for multivariate multi-domain esti- 
mation that guarantees that the sampling errors of 
the domain estimators are lower than pre-specified 
thresholds. The strategy combines the use of a bal- 
anced sampling technique and GREG estimation, 
but extensions to the use of synthetic estimators and 
model-based estimation are also considered. A suc- 
cessful application of this strategy requires good pre- 
dictions of weighted sums of residuals featuring in 
the variance expressions, and it may happen that 
the resulting overall sample size is far too large, but 



w 



EBLUP 



this is a promising approach that should be studied 
further. 

4.3 Pros and Cons of Design-Based Small Area 
Estimation 

The apparent advantage of design-based methods 
is that the estimation is less dependent on an as- 
sumed model, although models are used (assisted) 
for the construction of the estimators. The estima- 
tors are approximately unbiased and consistent un- 
der the randomization distribution for large sample 
sizes within the areas, which as discussed before is 
a desirable property that protects against possible 
model misspecification at least in large 

Against this advantage stand many disadvantages. 
Direct estimators generally have large variance due 
to small sample sizes. The survey regression estima- 
tor is approximately unbiased but may likewise be 
too variable. Synthetic estimators have small vari- 
ance but are generally biased. Composite estima- 
tors have smaller bias than synthetic estimators but 
larger variance, and it is not obvious how to best 
choose the weights attached to the synthetic esti- 
mator and the unbiased estimator. Computation of 
randomization-based confidence intervals generally 
requires large sample normality assumptions, but 
the sample sizes in at least some of the areas may 
be too small to justify asymptotic normality. 

Another limitation of design-based inference (not 
restricted to SAE) is that it does not lend itself to 
conditional inference, for example, conditioning on 
the sampled values of the covariates or the sampled 
clusters in a two-stage sampling design. This again 
inflates the variance of the estimators. Conditional 
inference is in the heart of classical statistical in- 
ference under both the frequentist and the Bayesian 
approaches. Last, but not least, an important limita- 
tion of design-based SAE is that there is no founded 
theory for estimation in areas with no samples. The 
use of the randomization distribution does not ex- 
tend to prediction problems, such as the prediction 
of small area means for areas with no samples. It is 
often the case that samples are available for only a 
minority of the areas, but estimators and MSE es- 
timators are required for each of the areas, whether 
sampled or not. 

5. MODEL-BASED METHODS 

5.1 General Formulation 

Model-based methods assume a model for the sam- 
ple data and use the optimal or approximately op- 
timal predictor of the area characteristic of interest 
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under the model. The MSE of the prediction error 
is likewise defined and estimated with respect to the 
model. Note that I now use the term "prediction" 
rather than estimation because the target character- 
istics are generally random under the model. The 
use of models overcomes the problems underlying 
the use of design-based methods, but it is impor- 
tant to emphasize again that even the most elab- 
orated model cannot produce sufficiently accurate 
predictors when the area sample size is too small, 
and no covariates with good predictive power are 
available. The use of models raises the question of 
the robustness of the inference to possible model 
misspecification, and Sections 6.3-6.6 review stud- 
ies that deal with this problem from different per- 
spectives. Section 8 considers model selection and 
diagnostic checking. 

Denote by 6% the target quantity in area i (mean, 
proportion, . . .). Let j/j define the observed responses 
for area i and x» define the corresponding values of 
the covariates (when available). As becomes evident 
below, yi is either a scalar, in which case x» is a vec- 
tor, or yi is a vector, in which case Xj is usually a 
matrix. A typical small area model consists of two 
parts: The first part models the distribution (or just 
the moments) of yi\0i',tpri)- The second part mod- 
els the distribution (moments) of 9i\xi]ip(2)i linking 
the (9jS to known covariates and to each other. This 
is achieved by including in the model random ef- 
fects that account for the variability of the 9iS not 
explained by the covariates. The hyper-parameters 
tp = (ip(i),ip(2)) are typically unknown and are es- 
timated either under the frequentist approach, or 
under the Bayesian approach by setting appropriate 
prior distributions. In some applications the index i 
may define time, in which case the model for 0j|xjj V>2 
is a time series model. 

5.2 Models in Common Use 

In this section, I review briefly three models in 
common use, as most of the recent developments in 
SAE relate to these models or extensions of them. 
For more details see Rao (2003), Jiang and Lahiri 
(2006a, 2006b), Datta (2009) and the references 
therein. I assume that the model holding for the 
sample data is the same as the model holding in 
the population, so that there is no sample selection 
bias. The case of informative selection of the areas 
to be sampled or informative sampling within the 
selected areas, whereby the sample selection or re- 
sponse probabilities are related to the response vari- 
able even after conditioning on the model covariates 



is considered in Section 7. Notice that in this case 
the sample model differs from the population model. 

5.2.1 Area level model This model is in broad use 
when the covariate information is only at the area 
level, so that Xj is a vector of known area character- 
istics. The model, studied originally for SAE by Fay 
and Herriot (1979) is defined as 

(5.1) yi = 8i + ei; 4 = x' i( 5 + m, 

where jji denotes the direct sample estimator of 9{ 
(e.g., the sample mean yi when the sample is se- 
lected by SRS), and represents the sampling er- 
ror, assumed to have zero mean and known design 
(randomization) variance, Var£>(ej) = cr^. The ran- 
dom effects Ui are assumed to be independent with 
zero mean and variance a\. For known cr^, the best 
linear unbiased predictor (BLUP) of 0j under this 
model is 

h = HVi + (1 - 7i) x ^GLS 

(5.2) = x^ GL s + r/i(Vi - x-/3 G ls) 
= x-/3 G ls + 

The BLUP 9i is in the form of a composite estimate 
[equation (4.4)], but with a tuning (shrinkage) co- 
efficient 7j = (j\l{o\ + cfjj), which is a function of 
the ratio of the variances of the prediction 

errors of x^/3 and y~i, respectively. The coefficient jt 
defines optimally the weights assigned to the syn- 
thetic estimator x^/3qls and y^, unlike the case of 
design-based estimators where the weight is assigned 
in a more ad hoc manner. See the discussion below 
(4.4). Note that the BLUP property does not re- 
quire specifying the distribution of the error terms 
beyond the first two moments, and 9i is also the lin- 
ear Bayes predictor in this case. Under normality of 
the error terms and a diffuse uniform prior for /3, 9i 
is the Bayesian predictor (posterior mean) of 9i. For 
a nonsampled area k, the BLUP is now obtained 
optimally as xJ^gls- 

In practice, the variance a\ is seldom known and is 
replaced in 73 and Pgls by a sample estimate, yield- 
ing what is known as the empirical BLUP (EBLUP) 
under the frequentist approach, or the empirical 
Bayes (EB) predictor when assuming normality. The 
latter predictor is the posterior mean of 9i, but with 
o\ replaced by a sample estimate obtained from the 
marginal distribution of the direct estimators given 
the variance. Alternatively, one may compute the 
Hierarchical Bayes (HB) predictor by assuming prior 
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distributions for f3 and a\ and computing the poste- 
rior distribution of Oi given the available data. The 
posterior distribution can be used for computation 
of the point predictor and a credibility (confidence) 
interval. 

Remark 1. The synthetic estimator x£/?glS) and 
hence the BLUP Oi are unbiased predictors under 
the joint distribution of and Oi in the sense that 
E{0i — Oi) = 0, but are biased when conditioning on 
Ui . The predictor Oi is biased also under the random- 
ization distribution. Conditioning on m amounts to 
assuming different fixed intercepts in different ar- 
eas and the unbiasedness of Oi under the model is 
achieved by viewing the intercepts as random. 

Remark 2. It is often the case that the link- 
ing model is defined for a transformation of Oi. For 
example, Fay and Herriot (1979) actually assume 
log(0j) = x^/3 + Ui in (5.1) and use the direct esti- 
mator yi = log(yj), and then predict Oi as exp(0j), 
where 0; is the BLUP (EBLUP) of log(0j) under 
the model. However, exp(#j) is not the BLUP of 
Oi = exp[log(#i)]. On the other hand, the EB and HB 
approaches produce optimal predictors of Oi , even if 
the linking model uses a transformation of Oi, with 
or without the use of a similar transformation for 
the direct estimator. In this respect, the latter two 
approaches are more flexible and with wider applica- 
bility, but at the expense of requiring further para- 
metric assumptions. 

5.2.2 Nested error unit level model This model 
uses individual observations y^ such that yi is now 
a vector, and Xj is generally a matrix. The use of 
this model for SAE requires that the area means 
Xj = ^2f=i*ij/Ni are known. The model, first pro- 
posed for SAE by Battese, Harter and Fuller (1988) 
has the form 

(5.3) y i j=y! ij P + Ui + e i j, 

where the iijS (random effects) and the EijS (residual 
terms) are mutually independent with zero means 
and variances o\ and <rf, respectively. Under the 
model, the true small area means are Yi = X[(3 + 
Ui + Ei, but since £j = Yljli e ij/ N -i — f° r large Aj, 
the target means are often defined as Oi = X^fi + Ui = 
E(Yi\ui). For known variances (a 2 , a 2 ), the BLUP of 
Oi is 

0i = 7i[yi + (Xi-x l )'f3 GLS ] 

(5-4) 

+ (l- 7 i)^/?GLS, 



where $gls is the GLS of /3 computed from all the 
observations, x» = YTjLi^ij / n i and ji = o-\j{o-\ + 
a"l/ Hi). For area k with no sample (but known Xk), 
the BLUP is k = XJ^gls- See Rao (2003) for the 
BLUP of the means Yi in sampled areas. 

The BLUP (5.4) is also the Bayesian predictor 
(posterior mean) under normality of the error terms 
and a diffuse uniform prior for (3. Replacing the vari- 
ances a\ and o 2 e in ji and /3qls by sample estimates 
yields the corresponding EBLUP or EB predictors. 
Hierarchical Bayes (HB) predictors are obtained by 
specifying prior distributions for j3 and the two vari- 
ances and computing the posterior distribution of Oi 
(or Yi) given all the sample observations in all the 
areas. Remark 1 applies to the BLUP (EBLUP) un- 
der this model as well. 

5.2.3 Mixed logistic model The previous two mod- 
els assume continuous responses. Suppose now that 
yij is binary, taking the values 1 or 0, in which 
case the small area quantities of interest are usually 
proportions or counts (say, the proportion or total 
of unemployed persons in the area). The following 
generalized linear mixed model (GLMM) considered 
originally by MacGibbon and Tomberlin (1989) for 
SAE is in broad use for this kind of problems: 

Pr(y,y = l\pij) =pij; 

(5.5) 

logit(p^) = x-^/3 + m; Ui ~ A(0, a u ). 

The responses y^ are assumed to be conditionally 
independent, given the random effects Ui, and like- 
wise for the random effects. The purpose is to pre- 
dict the true area proportions = ^2f=i Vij/Ni. Let 

i/j = ((3,CTu) denote the model parameters. For this 
model, there is no explicit expression for the best 
predictor (BP) under a quadratic loss function, that 
is, for pf p = E(pi\yi,Xi;tp), but as shown in Jiang 
and Lahiri (2006b), the BP can be computed (ap- 
proximated) numerically as the ratio of two one- 
dimensional integrals. Jiang and Lahiri review meth- 
ods of estimating ip, yielding the empirical BP (EBP) 
pf BP = E(pi\yi,x.i;if;), which is also the EB predic- 
tor under the same assumptions. Application of the 
full HB approach under this model consists of the 
following basic steps: 

(1) specify prior distributions for o~\ and j3; 

(2) generate observations from the posterior dis- 
tributions of f3, a\ and u\,...,u m by say, MCMC 
simulations, and draw a large number of realiza- 
tions 0( r \al {r \{4 r) }), r = l,...,R,i = l,...,m, 
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j i ,. , ■ (r) (r) eM< h P (r) +4 r) ) 
and hence realizations wV ~ pV = — ,,,, 

for k $l Si] 

(3) predict: ft = (£ i£s . %i + £fc£ Si Vik)/N%', Vik = 
Writing Vi = h Ef=i(E ieSi Htf+E**, J/J?)/** = 



"H E^LiPi > the posterior variance is approximated 

as V pos t{Pi) = ^(^i) J2r=i(Pi r) -Pi) 2 - 

Ghosh et al. (1998) discuss the use of HB SAE for 
GLMM, covering binary, count, multi-category and 
spatial data. In particular, sufficient conditions are 
developed for the joint posterior distribution of the 
parameters of interest to be proper. 

6. NEW DEVELOPMENTS IN MODEL-BASED 

SAE 

6.1 Estimation of Prediction MSE 

As stated in the introduction, an important aspect 
of SAE is the assessment of the accuracy of the pre- 
dictors. This problem is solved "automatically" un- 
der the Bayesian paradigm, which produces realiza- 
tions of the posterior distribution of the target quan- 
tities. However, estimation of the prediction MSE 
(PMSE) and the computation of confidence intervals 
(C.I.) under the frequentist approach is complicated 
because of the added variability induced by the esti- 
mation of the model hyper-parameters. Prasad and 
Rao (1990) developed PMSE estimators with bias of 
order o(l/m), (m is the number of sampled areas), 
under the linear mixed models (5.1) and (5.2) for the 
case where the random errors have a normal distri- 
bution, and the model variances are estimated by 
the ANOVA method of moments. Datta and Lahiri 
(2000) extended the estimation of Prasad and Rao 
to the more general mixed linear model, 



(6.1) m = Xi/3 + + a 



l,...,m, 



where Aj and Zi are fixed matrices of order rij x k 
and rii x d, respectively, and U{ and are indepen- 
dent normally distributed random effects and resid- 
ual terms of orders d x 1 and rij x 1, respectively, 
Ui ~ Nd(0,Qi), ei ~ N ni (0,Ri). The variance ma- 
trices are known functions of variance components 
C = (Ci) • • • ; Cl)- The authors develop PMSE estima- 
tors with bias of order o(l/m) for the EBLUP ob- 
tained when estimating f3 and £ by MLE or REML. 
Das, Jiang and Rao (2004) extend the model of 
Datta and Lahiri (2000) by relaxing the assumption 



of independence of the error terms between the ar- 
eas and likewise develop an estimator for the PMSE 
of the EBLUP when estimating the unknown model 
parameters by MLE or REML, with bias of order 
o(l/m). Datta, Rao and Smith (2005) show that for 
the area level model (5.1), if a 2 is estimated by the 
method proposed by Fay and Herriot (1979), it is re- 
quired to add an extra term to the PMSE estimator 
to achieve the desired order of bias of o(l/m). See 
Datta (2009) for an extensive review of methods of 
estimating the PMSE of the EBLUP and EB under 
linear mixed models (LMM). 

Estimation of the PMSE under the GLMM is more 
involved, and in what follows, I review resampling 
procedures that can be used in such cases. For conve- 
nience, I consider the mixed logistic model (5.5), but 
the procedures are applicable to other models be- 
longing to this class. The first procedure, proposed 
by Jiang, Lahiri and Wan (2002) uses the jackknife 
method. Let A; = E(pf BP - pi) 2 denote the PMSE, 
where pi = J2f=iVij/^i is the true proportion and 
pf BP = E(pi\yi,x.i;ip) is the EBP. The following de- 
composition holds: 



(6.2) 



A 



M u + M 2i , 



(EBP) ~(BPh2 
-Pi ) 



where Mu is the PMSE of the BP (assumes known 
parameter values) and Mu is the contribution to 
the PMSE from estimating the model parameters, 
tjj. Denote by Af p (?/>) the "naive" estimator of Mu, 
obtained by setting tp = Let Af p (?/>_;) denote the 
naive estimator when estimating tp from all the ar- 
eas except for area I, and pf BP (ip-i) denote the cor- 
responding EBP. The jackknife estimator of PMSE 
is 



Xf K = M li + M 2l ; 



(6.3) 



M li = Af p (V0 



m 



m 



1=1 



1 m 

i=i 

Under some regularity conditions, E(\j K ) — Aj = 
o(l/m), as desired. 

The jackknife estimator estimates the uncondi- 
tional PMSE over the joint distribution of the ran- 
dom effects and the responses. Lohr and Rao (2009) 
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proposed a modification of the jackknife, which is 
simpler and estimates the conditional PMSE, 

E [(Pi -Pi) 2 \yi\- Denoting qi(ip,yi)=Vax(pi\yi;il)), 
the modification consists of replacing Mu in (6.3) by 
Mu :C = qi($,yi) - TA^ihi^-hVi) ~ ?iW>>J/i)]- The 
modified estimator = Mu,c + M 2 i has bias of 
order o p (l/m) in estimating the conditional PMSE 
and bias of order o(l/m) in estimating the uncon- 
ditional PMSE. 

Hall and Maiti (2006) propose estimating the 
PMSE by use of double-bootstrap. For model (5.5), 
the procedure consists of the following steps: 

(1) Generate a new population from the model 
(5.5) with parameters ip and compute the "true" 
area proportions for this population. Compute the 
EBPs based on new sample data and newly esti- 
mated parameters. The new population and sample 
use the same covariates as the original population 
and sample. Repeat the process independently B\ 
times, with B\ sufficiently large. Denote by Pi^ii^) 

and pf^'^i'i'bi) the "true" proportions and corre- 
sponding EBPs for population and sample b\, b± = 
1,...,B±. Compute the first-step bootstrap PMSE 
estimator, 



(6.4) 



\-Y^ : 



Bi 



(2) For each sample drawn in Step (1), repeat the 
computations of Step (1) B2 times with B 2 suffi- 
ciently large, yielding new "true" proportions 

Pifoi^h) and EBPs p> k '{ipb 2 ), b 2 = l,..-,B 2 . Com- 
pute the second-step bootstrap PMSE estimator, 

Si 



\BS 



1 1 1 

— Y — 

B, ^ Bo 



(6.5) 



B 2 



■ E^S? P) (^ 



■Pi.fcWl)] • 



62=1 



The double-bootstrap PMSE estimator is obtained 
by computing one of the classical bias corrected es- 
timators. For example, 

r*?r+(*g-*g). 

ifA^>ABf, 
A^exp[(A^-AB|)/AB|], 

ifA^<ABf. 

Notice that whereas the first-step bootstrap esti- 
mator (6.4) has bias of order 0(l/m), the double- 



(6.6) A 



D-BS 



bootstrap estimator has bias of order o(l/m) under 
some regularity conditions. 

Pfeffermann and Correa (2012) develop a general 
method of bias correction, which models the error 
of a target estimator as a function of the corre- 
sponding bootstrap estimator, and the original esti- 
mators and bootstrap estimators of the parameters 
governing the model fitted to the sample data. This 
is achieved by drawing at random a large number 
of plausible parameters governing the model, gener- 
ating a pseudo original sample for each parameter 
and bootstrap samples for each pseudo sample, and 
then searching by a cross validation procedure the 
best functional relationship among a set of eligible 
bias correction functions that includes the classical 
bootstrap bias corrections. The use of this method 
produces estimators with bias of correct order and 
under certain conditions it also permits estimating 
the MSE of the bias corrected estimator. Applica- 
tion of the method for estimating the PMSE under 
the model (5.5) in an extensive simulation study out- 
performs the double-bootstrap and jackknife pro- 
cedures, with good performance in estimating the 
MSE of the PMSE estimators. 

Remark 3. All the resampling methods consid- 
ered above are in fact model dependent since they 
require computing repeatedly the empirical best pre- 
dictors under the model. 

Chambers, Chandra and Tzavidis (2011) develop 
conditional bias-robust PMSE estimators for the case 
where the small area estimators can be expressed as 
weighted sums of sample values. The authors as- 
sume that for unit j 6 Ui, yj = x'-/3j + e^; E(ej) = 0, 
Var(ej) = cr|, j = 1, . . . , nj, with /3j taken as a fixed 
vector of coefficients, and consider linear estimators 
of the form 0i = Ylkes w ikyk with fixed weights w^. 
Thus, if 0i defines the true area mean, 



Bias; = E(0i - 6$ 



(6.7) 



E Y W i3^jPh ~ Xifc, 



\h=lj£s h 



Var; = Var(0i - 0< 



n: 



where r j = Ui — Si and ct,ij = NiWij — I(j 6 Ui), with 
/(■) defining the indicator function. Assuming that 
for j € U, fij = E(yj\xj) = x^/3; is estimated as fij = 

x 'jfii = ^kes&kjyk and cr| = c 2 , the bias and vari- 
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ance in (6.7) are estimated as 

Bias; =1^2 Wijfij J - Nr 1 ^ fa, 
\h=ijes h ) jeiii 

(6.8) V f x i = Nr 2 Y^[aZ j + (N i -n i )nr 1 ] 
jes 

■^(yj-fa) 2 , 

where Xj = (1 - c^-) 2 + T,ke ? (-p) $fcj» and s ( _ j) de ~ 
fines the sample without unit j. 

The authors apply the procedure for estimating 
the PMSE of the EBLUP and the MBDE estima- 
tor (4.11) under model (5.3), and for estimating the 
PMSE of the M-quantile estimator defined in Sec- 
tion 6.6. For the first two applications the authors 
condition on the model variance estimators so that 
the PMSE estimators do not have bias of desired or- 
der even under correct model specification. On the 
other hand, the estimators are shown empirically 
to have smaller bias than the traditional PMSE es- 
timators in the presence of outlying observations, 
although with larger MSEs than the traditional es- 
timators in the case of small area sample sizes. 

6.2 Computation of Prediction Intervals 

As in other statistical applications, very often ana- 
lysts are interested in prediction intervals for the un- 
known area characteristics. Construction of predic- 
tion intervals under the Bayesian approach, known 
as credibility intervals, is straightforward via the pos- 
terior distribution of the predictor. A "natural" pre- 
diction interval under the frequentist approach with 
desired coverage rate (1 — a) is o\ ± Q ,/ 2 [Var(^ ' — 
6»i)] 1/2 , where #[ } is the EB, EBP or EBLUP pre- 
dictor, and Var(#j ' — 6i) is an appropriate estimate 
of the prediction error variance. However, even un- 
der asymptotic normality of the prediction error, the 
use of this prediction interval has coverage error of 
order 0(l/m), which is not sufficiently accurate. Re- 
cent work in SAE focuses therefore on reducing the 
coverage error via parametric bootstrap. 

Hall and Maiti (2006) consider the following gen- 
eral model: for a suitable smooth function fi(/3) of 
the covariates Xj = (xji, . . . ,Xj n J in area i and a vec- 
tor parameter j3, random variables 0; = fi{(3) + ui; 
E(ui) = are drawn from a distribution Q{fi((3); £}. 
The outcome observations yij are drawn indepen- 
dently from a distribution i?{Z(0j); r]i}, where /(•) is 
a known link function, and r\i is either known or is 



the same for every area i. For given covariates Xio, 
sample size n;o and known parameters, an a-level 
prediction interval for the corresponding realization 
@i0 is 

(6.9) I a (J3,£) = [q {1 - a)/ 2(^0,Q(i+a)/2(P,0i 

where <fo(/3,£) defines the a-level quantile of the 
distribution Q{fi(l3);£}. A naive prediction inter- 
val with estimated parameters is I a (j3,£), but this 
interval has coverage error of order 0(l/m), and it 
does not use the area-specific outcome values. To re- 
duce the error, I a ($, £) is calibrated on a. This is im- 
plemented by generating parametric bootstrap sam- 
ples and re-estimating /3 and £ similarly to the first 
step of the double-bootstrap procedure for PMSE 
estimation described in Section 6.1. Denote by J* = 
J a (/3*,£*) the bootstrap interval, and let a denote 
the solution of the equation Pr(#? 6 IV) = a, where 

®t ~ Q{fiip) >£}• The bootstrap-calibrated predic- 
tion interval with coverage error of order 0(m~ 2 ) is 

Chatterjee, Lahiri and Li (2008) consider the gen- 
eral linear mixed model of Das, Jiang and Rao (2004), 
mentioned in Section 6.1: Y = X/3 + Zu + e, where 
Y (of dimension n) signifies all the observations in 
all the areas, X nxp and Z nxq are known matrices 
and u and e are independent vector normal errors 
of random effects and residual terms with variance 
matrices D(ip) and R(i/j), which are functions of a 
fc-vector parameter ip. Note that this model and the 
model of Hall and Maiti (2006) include as special 
cases the mixed linear models defined by (5.1) and 
(5.3). The present model cannot handle nonlinear 
mixed models [e.g., the GLMM (5.5)], which the 
Hall and Maiti model can, but it does not require 
conditional independence of the observations given 
the random effects, as under the Hall and Maiti 
model. 

The (parametric bootstrap) prediction interval of 
Chatterjee, Lahiri and Li (2008) for a univariate 
linear combination t = c'(X/3 + Zu) is obtained by 
the following steps. First compute the conditional 
mean, ut and variance er 2 of t\Y;/3,ip. Next gener- 
ate new observations y* = X/3 + Zu* + e* , where 
u* ~ N(0,D(4>)), e* ~ N(0,R(ip)). From y* , esti- 
mate (3* and ip* using the same method as for f3 
and i/>, and compute jtf. and of (same as fit and 
at, but with estimated parameters). Denote by L* 
the bootstrap distribution of (of ) (i* — /t?)i where 
t* = c '(Xf3 + Zu*), and let d=(p + k) be the total 
number of unknown parameters. Then as d 2 /n — > 



12 



D. PFEFFERMANN 



and under some regularity conditions, if gi, q 2 sat- 
isfy L* n (q 2 ) - L* n ( qi ) = 1 - a, 

Pr(p4 + qi& t <t<pL t + q-iot) 

(6.10) 

= l-a + 0{d 3 n- 3 / 2 ). 

Note that this theory allows d to grow with n and 
that the coverage error is defined in terms of n rather 
than m, the number of sampled under the 

Hall and Maiti (2006) approach. The total sample 
size increases also as the sample sizes within the 
areas increase, and not just by increasing m. By 
appropriate choice of t, the interval (6.10) is area 
specific. 

Remark 4. The article by Chatterjee, Lahiri 
and Li (2008) contains a thorough review of many 
other prediction intervals proposed in the literature. 

6.3 Benchmarking 

Model-based SAE depends on models that can 
be hard to validate and if the model is misspeci- 
fied, the resulting predictors may perform poorly. 
Benchmarking robustifies the inference by forcing 
the model-based predictors to agree with a design- 
based estimator for an aggregate of the areas for 
which the design-based estimator is reliable. Assum- 
ing that the aggregation contains all the areas, the 
benchmarking equation takes the general form, 



(6.11) ^ Mi.model = Mi, 



design' 



i=l 



i=l 



The coefficients {bi} are fixed weights, assumed with- 
out loss of generality to sum to 1 (e.g., relative area 
sizes). Constraint (6.9) has the further advantage of 
guaranteeing consistency of publication between the 
model-based small area predictors and the design- 
based estimator for the aggregated area, which is 
often required by statistical bureaus. For example, 
the model-based predictors of total unemployment 
in counties should add up to the design-based esti- 
mate of total unemployment in the country, which 
is deemed accurate. 

A benchmarking method in common use, often 
referred to as ratio or pro-rata adjustment, is 



(6.12) 



%Ratio ~~ ( fojff?,design / foj^j,model 
\j=l ' j=l / 



'i. model' 



precision of the model-based predictors before bench- 
marking. As a result, the prorated predictor in a 
given area is not consistent as the sample size in 
that area increases. Additionally, estimation of the 
PMSE of the prorated predictors is not straight- 
forward. Consequently, other procedures have been 
proposed in the literature. 

Wang, Fuller and Qu (2008) derive benchmarked 
BLUP (BBLUP) under the area level model (5.1) 
as the predictors minimizing ^™=\ ¥iE{6i — ^ ench ) 2 
subject to (6.11), where the ipiS are chosen positive 
weights. The BBLUP is 



nbench _ ziBLUP 
\BLUP — p i,modcl 



(6.13) 



The use of this procedure, however, applies the same 
ratio correction for all the areas, irrespective of the 



+ $i design Qj, model ) i 

3=1 

When the variance a 2 , is unknown, it is replaced by 
its estimator everywhere in (6.13), yielding the em- 
pirical BBLUP. You and Rao (2002) achieve "auto- 
matic benchmarking" for the unit level model (5.3) 
by changing the estimator of f3. Wang, Fuller and 
Qu (2008) consider a similar procedure for the area 
level model. Alternatively, the authors propose to 
augment the covariates x^ to x^ = (xj, biG 2 Di ). (The 
variances a 2 Di are considered known under the area 
level model.) The use of the augmented model yields 
a BLUP that likewise satisfies the benchmark con- 
straint (6.11) and is more robust to omission of an 
important covariate from Xj , provided that the miss- 
ing covariate is sufficiently correlated with the added 
covariate in Xj. 

Pfeffermann and Tiller (2006) add monthly bench- 
mark constraints of the form (6.11) to the measure- 
ment (observation) equation of a time series state- 
space model fitted jointly to the direct estimates in 
several areas. Adding benchmark constraints to time 
series models is particularly important since time 
series models are slow to adapt to abrupt changes. 
The benchmarked predictor obtained under the aug- 
mented time series model belongs to the family of 
predictors (6.13) proposed by Wang, Fuller and Qu 
(2008). By adding the constraints to the model equa- 
tions, the use of this approach permits estimating 
the variance of the benchmarked estimators as part 
of the model fitting. The variance accounts for the 
variances of the model error terms, the variances and 
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autocovariances of the sampling errors of the direct 
estimators and of the benchmarks YliLi Misdirect > 
t = 1,2, ... , and the cross-covariances and autoco- 
variances between the sampling errors of the direct 
estimators and the benchmarks. 

Datta et al. (2011) develop Bayesian benchmark- 
ing by minimizing 



bench \ 2 



l^design] S.t. 



(6.14) 



i=l 



design ; 



i=l 



i=l 



where design = (^.design, ■ • • , ^m.dcsign)'- The solution 
of this minimization problem is the same as (6.13), 
but with ^^odei replaced everywhere by the pos- 
terior mean #fc,Bayes- Denote the resulting predic- 
tors by ^g 1 ^ 1 . The use of these predictors has the 
drawback of "over shrinkage" in the sense that 

Ei=l M^B^yes ~ ^6,Bayes) 2 < E™1 ~ hf\ 

-bench, 1 "bench 1 ~~ 

^design], Where 6)Ba ye S = YT=l & <^Bayc' S and 9 b = 

Ei=iMi- To deal with this problem, Datta et al. 
(2011) propose to consider instead the predictors 
^Bayes 2 ' satisfying the constraints 

m m 

MjBave's = ^ Mi, design! 



(6.15) 



i=l 



i=l 



i=l 



gbench,2 
i,Baycs 



E 6 

i=i 



design 



where H = ET=i b i E l( e i ~ &b) 2 \0 dcsim }- The bench- 
marked predictors have now the form 



nbench,2 
i,Bayes 



^ bi@i, design 
i=l 



(6.16) +A CB (6 l 

,Bayes 

m 

A 2 CB = H/j2bi(kB 



Ob 



ayes ) > 



ayes 



9b 



ayes J 



i=l 



Nandram and Sayit (2011) likewise consider Bayes- 
ian benchmarking, focusing on estimation of area 
proportions. Denoting by c; the number of sample 
units in area i having characteristic C, and by pi the 
probability to have this characteristic, the authors 
assume the beta-binomial hierarchical Bayesian model, 

Ci\pi ~ Binomial (n*, pi); 

(6.17) Pi|/i,r ~ Beta[/ir, (1 — fj)r], i = l,...,m, 



Notice that the development of the Bayesian bench- 
marked predictors is general and not restricted to 
any particular model. The PMSE of the benchmarked 



predictor can be estimated as Stl^B^f — Oi 

bench, 2 f\ \2 



( _i 



csignj 



Var(0j 



j,Bayes | ['design 

) + (A,Bayes ~ ^.BayesJ , 

noting that the cross-product E[(9 iB n ^ ' 



5 ayes 



$i,Bayes) (^i,Baycs ^i)|^design] — 0- 



p(/x,r) = (l + r 2 



0<^< l,r>0. 



Let bi = rii/n. The benchmark constraint is defined 
as, 



(6.18) jH&iPi = 0; 0~Beta[/i O To,(l 



i=l 



The authors derive the joint posterior distribution 
of the true probabilities {pi,i = 1, . . . , m} under the 
unrestricted model (6.17), and the restricted model 
with (6.18), and prove that it is proper. Computa- 
tional details are given. Different scenarios are con- 
sidered regarding the prior distribution of 9. Un- 
der the first scenario To — > oo, implying that 6 is a 
point mass at /xq, assumed to be known. Under a 
second scenario fio and tq are specified by the an- 
alyst. In a third scenario fio = 0.5, to = 2, implying 
8 ~ Uniform(0, 1) (noninformative prior). Theoreti- 
cal arguments and empirical results show that the 
largest gain from using the restricted model is under 
the first scenario where 9 is completely specified, fol- 
lowed by the second scenario with To 2> 2. No gain 
in precision occurs under the third scenario with a 
noninformative prior. 

To complete this section, I mention a different fre- 
quentist benchmarking procedure applied by Ugarte, 
Militino and Goicoa (2009). By this procedure, the 
small area predictors in sampled and nonsampled 
areas under the unit level model (5.3) are bench- 
marked to a synthetic estimator for a region com- 
posed of the areas as obtained under a linear re- 
gression model with heterogeneous variances (but no 
random effects). The benchmarked predictors mini- 
mize a weighted residual sum of squares (WRSS) un- 
der model (5.3) among all the predictors satisfying 
the benchmark constraint. Notice that the predic- 
tors minimizing the WRSS without the constraint 
are the optimal predictors (5.4). For known vari- 
ances the benchmarked predictors are linear, but in 
practice the variances are replaced by sample es- 
timates. The authors estimate the PMSE of the re- 
sulting empirical benchmarked predictors by a single- 
step parametric bootstrap procedure. 
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6.4 Accounting for Measurement Errors in the 
Covariates 

Ybarra and Lohr (2008) consider the case where 
some or all the covariates Xj in the area level model 
(5.1) are unknown, and one uses an estimator Xj ob- 
tained from another independent survey, with 
MSE£)(xj) = Cj under the sampling design. (For 
known covariates x^i, Cki = 0.) Denoting the re- 
sulting predictor by 6f", it follows that for known 

(6.19) PMSE(#f rr ) = PMSE(0;) + (1 - n) 2 f3' C if 3 , 

where PMSE(^) is the PMSE if one knew x;. Thus, 
reporting PMSE(#j) in this case results in under- 
reporting the true PMSE. Moreover, if f3'Ci(3 > a 2 + 
a 2 Di , MSE(flf rr ) > a 2 Di = Var D (j/i). The authors pro- 
pose therefore to use instead the predictor 

ef e = liVi + (1-7^/3; 

(6.20) 

7 4 = {a\ + P'Ql3)/(a 2 Di + a 2 u + P'Ctf). 

The predictor 0^ minimizes the MSE of linear com- 
binations of \ji and x^/3. Additionally, E{6^ c — 9i) = 
(1 — 7i)[£ , z)(xi) — Xj]'/3, implying that the bias van- 
ishes if Xj is unbiased for Xj, and -E(#^ e — Oi) 2 = 
li^Di < °r>i- The authors develop estimators for a 2 
and /3, which are then substituted in (6.20) to obtain 
the corresponding empirical predictor. The PMSE of 
the empirical predictor is estimated using the jack- 
knife procedure of Jiang, Lahiri and Wan (2002), 
described in Section 6.1. 

Ghosh, Sinha and Kim (2006) and Torabi, Datta 
and Rao (2009) study a different situation of mea- 
surement errors. The authors assume that the true 
model is the unit level model (5.3) with a single co- 
variate Xi for all the units in the same area, but 
%i is not observed, and instead, different measure- 
ments obtained for different sampled units 
j € Sj. The sample consists therefore of the obser- 
vations {yij,Xij]i = 1, ... ,m, j = 1, • • • , rii}. An ex- 
ample giving rise to such a scenario is where Xi 
defines the true level of air pollution in the 
and the Xij's represent pollution measures at dif- 
ferent sites in the area. It is assumed that Xij = 
Xi + r]ij; Xi ~ N(fi x ,al), and !//,..:,,.//;,) are inde- 
pendent normally distributed random errors with 
zero means and variances a 2 , a 2 and a 2 , respec- 
tively. Since Xi is random, this kind of measurement 
error is called structural measurement error. The dif- 
ference between the two articles is that Ghosh, Sinha 
and Kim (2006) only use the observations {i/ij} for 



predicting the true area means Yj, whereas Torabi, 
Datta and Rao (2009) also use the sample observa- 
tions {x^}. 

Assuming that all the model parameters are known, 
the posterior distribution of the unobserved y-values 
in area i is multivariate normal, which under the ap- 
proach of Torabi, Datta and Rao (2009) yields the 
following Bayes predictor (also BLUP) for Y: 
~ B 

Y { =E{Yi\{yi j ,x ij ,j = \,...,m}) 

(6.21) = (1 - fiAi)yi + fiM(3o + Pxlh) 

+ fiAiixifi^Xi -fi x ), 
where fi = l- (nj/iVj), j xi = nia 2 (a 2 + riia 2 )' 1 and 
A = \niji\o 2 x o 2 + (ni(j 2 u + o- 2 )v i \- 1 o- 2 v i , with v { = 
(a 2 +ni(T 2 ). For large Ni and small (rij/TVj), the 

PMSE of yf is EKY* - Yi) 2 \ { Vij , Xij }] = A, [(3 2 a 2 x + 
a 2 — riij3 2 a x v i ]. Estimating the model parameters 
ijj = (/3q, (3i, fj, x , a 2 , a 2 , a 2 , a 2 ) by a method of mo- 
ments (MOM) proposed by Ghosh, Sinha and Kim 
(2006) and replacing them by their estimates yields 
the EB estimator, which is shown to be asymptoti- 

-EB 

cally optimal in the sense that m Y^ILi ^(Yi ~ 
-B „ 

Y { ) 2 ^0asm^oo. The PMSE of the EB predic- 
tor is estimated by a weighted jackknife procedure 
of Chen and Lahiri (2002). 

The Bayes predictor of Ghosh, Sinha and Kim 
(2006) has a similar structure to (6.21), but without 
the correction term fiAcj x ifi\{xi — fi x ), and with the 
shrinkage coefficient Ai replaced by Ai = [rii(/3 2 a 2 + 
°u) + a e] -1<T £ i n the other two terms. As noted above, 
the authors develop a MOM for estimating the un- 
known model parameters to obtain the EB predictor 
and prove its asymptotic optimality. They also de- 
velop an HB predictor with appropriate priors for 
all the parameters. The HB predictor and its PMSE 
are obtained by MCMC simulations. 

Ghosh and Sinha (2007) consider the same unit 
level model as above with sample observations {{yij}, 
{xij}), but assume that the true covariate xi is a 
fixed unknown parameter, which is known as func- 
tional measurement error. The work by Ybarra and 
Lohr (2008) reviewed before also assumes a func- 
tional measurement error, but considers the area 
level model. For known parameters and Xi, the Bayes 
predictor takes now the simple form 

$f = E(Y i \{y ij ,j = l,...,n i }) 

(6.22) = (1 - fiBi)yi + fiBiiPo + Pm); 
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A pseudo-Bayes predictor (PB) is obtained by sub- 
stituting the sample mean Xj for Xj in (6.22). A pseu- 
do-empirical Bayes predictor (PEB) is obtained by 
estimating all the other unknown model parameters 
by the MOM developed in Ghosh, Sinha and Kim 
(2006). The authors show the asymptotic optimal- 
ity of the PEB, m^ 1 YT=i E(Y BEB - Y BB ) 2 -> as 
m — > oo. 

Datta, Rao and Torabi (2010) propose to replace 
the estimator Xj of Xj by its maximum likelihood es- 
timator (MLE) under the model. The corresponding 
PB of Y{ (assuming that the other model parame- 
ters are known) is the same as the PB of Ghosh 
and Sinha (2007), but with Bi replaced by Bi = 
(riicrl + a\ + Pfa 2 ) _1 a 2 . A PEB predictor is ob- 
tained by replacing the model parameters by the 
MOM estimators developed in Ghosh, Sinha and 
Kim (2006), and it is shown to be asymptotically op- 
timal under the same optimality criterion as before. 
The PMSE of the PEB is estimated by the jack- 
knife procedures of Jiang, Lahiri and Wan (2002) 
described in Section 6.1 and the weighted jackknife 
procedure of Chen and Lahiri (2002). The authors 
report the results of a simulation study showing that 
their PEB predictor outperforms the PEB of Ghosh 
and Sinha (2007) in terms of PMSE. A modification 
to the predictor of Ybarra and Lohr (2008) is also 
proposed. 

6.5 Treatment of Outliers 

Bell and Huang (2006) consider the area level mod- 
el (5.1) from a Bayesian perspective, but assume 
that the random effect or the sampling error (but 
not both) have a nonstandardized Student's t^) dis- 
tribution. The t distribution is often used in statisti- 
cal modeling to account for possible outliers because 
of its long tails. One of the models considered by the 
authors is 

Ui\6i,al ~ N(0,6ial); 
(6.23) 5' 1 ~Gamma[A;/2,(/c-2)/2], 

which implies E(5i) = 1 and Ui\a 2 ~ i(fc)(0, cr 2 (k — 
2)/k). The coefficient <5j is distributed around 1, 
inflating or deflating the variance of Uj = 6{ — 
A large value 5i signals the existence of an outlying 
area mean The degrees of freedom parameter, 
k, is taken as known. Setting k = co is equivalent 
to assuming the model (5.1). The authors consider 



several possible (small) values for k in their applica- 
tion, but the choice of an appropriate value depends 
on data exploration. Alternatively, the authors as- 
sume model (6.23) for the sampling error (with 
a 2 Di instead of a 2 ), in which case it is assumed that 
Ui ~ N(0,a 2 ). The effect of assuming the model for 
the random effects is to push the small area predic- 
tor (the posterior mean) toward the direct estima- 
tor, whereas the effect of assuming the model for 
the sampling errors is to push the predictor toward 
the synthetic part. The use of either model is shown 
empirically to perform well in identifying outlying 
areas, but at present it is not clear how to choose 
between the two models. Huang and Bell (2006) ex- 
tend the approach to a bivariate area level model 
where two direct estimates are available for every 
area, with uncorrelated sampling errors but corre- 
lated random effects. This model handles a situation 
where estimates are obtained from two different sur- 
veys. 

Ghosh, Maiti and Roy (2008) likewise consider 
model (5.1) and follow the EB approach. The start- 
ing point in this study is that an outlying direct es- 
timate may arise either from a large sampling error 
or from an outlying random effect. The authors pro- 
pose therefore to replace the EB predictor obtained 
from (5.2) by the robust EB predictor, 

ef oh = m - (i - ii)Vi* G m - ^glsW- 1 ]; 

(6.24) 

y^Var^-x'^GLs), 

where /3ql S 1S the empirical GLS under the model 
with estimated variance a 2 , and ^S>g is the Huber in- 
fluence function x I / g(^) = sig n (0 min(G, \ t\) for some 
value G > 0. Thus, for large positive standardized 
residuals (y, - x^^g)^" 1 , the EB §f B = yi - (1 - 
Ji)Vi(y~i — x i/3cLs)V; -1 under the model is replaced 
by 9f- oh = jji — (1— Ji)ViG, and similarly for large 
negative standardized residuals, whereas in other 
cases the ordinary EB, 0f B , is unchanged. The value 
G may change from one area to the other, and it 
is chosen adaptively in such a way that the excess 
Bayes risk under model (5.1) from using the pre- 
dictor (6.24) is bounded by some percentage point. 
Alternatively, G may be set to some constant 1 < 
Go < 2, as is often found in the robustness litera- 
ture. The authors derive the PMSE of 0?" ob under 
the model (5.1) for the case where a 2 is estimated 
by MLE with bias of order o(l/m), and develop an 
estimator for the PMSE that is correct up to the 
order O p (l/m). 
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Under the approach of Ghosh, Maiti and Roy 

(2008) , the EB predictor is replaced by the robust 
predictor (6.24), but the estimation of the unknown 
model parameters and the development of the PMSE 
and its estimator are under the original model (5.1), 
without accounting for possible outliers. Sinha and 
Rao (2009) propose to robustify also the estimation 
of the model parameters. The authors consider the 
mixed linear model (6.1), which when written com- 
pactly for all the observations y = (y[, . . . , y' m )' , has 
the form 

y = X/3 + Zu + e, 

(6.25) E(u) = 0, E(uu') = Q; 

E(e)=0, E(ee') = R, 

where u is the vector of random effects, and e is the 
vector of residuals or sampling errors. The matrices 
Q and R are block diagonal with elements that are 
functions of a vector parameter C = (&>••• > Cl) of 
variance components such that V{y) = V = ZQZ' + 
R = V(C). The target is to predict the linear com- 
bination t = l'/3 + h'u by t = I f3 + h'u. Under the 
model, the MLE of /3 and £ are obtained by solv- 
ing the normal equations X'V~ 1 (y — X/3) = 0; (y — 
XP)'V- x ^V-\y-XP) -MU- 1 ^) =0,1 = 1,..., 
L. To account for possible outliers, the authors pro- 
pose solving instead 

X'V~ 1 U 1/2 y G (r)=0; 
par 

(6.26) ^' G (r)U l / 2 V- l ^-V~ l U l /H G {r) 

-trfy-^clj =0, 1 = 1,..., L, 

where r = U~ l / 2 {y - X/3), U = Diag[V], 9 G (r) = 
[$> G (ri),fy G (r 2 ),. ■ .]' with ^ G (r k ) defining the Hu- 
ber influence function, I n is the identity matrix of or- 
der n and c = E[^%(r k )] [r k ~ N(0, 1)]. Notice that 
since Q and R are block diagonal, the normal equa- 
tions and the robust estimating equations can be 
written as sums over the m areas. 

Denote by /3R b> Cilob the solutions of (6.26). The 
random effects are predicted by solving 

Z'R-^alR^iy - X/3 Rob - Zu)} 

(6.27) 

-<r 1/2 * G (<r 1/2 u) = o, 

where R = -R(CRob), Q = Q(Cllob)- Sinha and Rao 

(2009) estimate the PMSE of the robust small area 
predictors by application of the first step of the 



double-bootstrap procedure of Hall and Maiti (2006) 
(equation 6.4). The parameter estimates and the 
predictors of the random effects needed for the appli- 
cation of the bootstrap procedure are computed by 
the robust estimating equations (6.26)-(6.27), but 
the generation of the bootstrap samples is under 
the original model with no outliers. The estimation 
of the PMSE can possibly be improved by generat- 
ing some outlying observations, thus reflecting more 
closely the properties of the original sample. 

6.6 Different Models and Estimators for Further 
Robustification 

In this section I review four different approaches 
proposed in the literature for further robustifica- 
tion of the inference by relaxing some of the model 
assumptions or using different estimators. All four 
studies focus on the commonly used area- level and /or 
unit-level models defined by (5.1) and (5.3), respec- 
tively. 

M-quantile estimation Classical model-based SAE 
methods model the expectations E(yi\xi,Ui) and 
E(ui). Chambers and Tzavidis (2006) and Tzavidis, 
Marchetti and Chambers (2010) propose modeling 
instead the quantiles of the distribution /(j/j|xj), 
where for now yi is a scalar. Assuming a linear model 
for the quantiles, this leads to a family of models in- 
dexed by the coefficient q € (0, 1); q = Pr[j/j < x£/? g ]. 
In quantile regression the vector j3 q is estimated by 
minimizing 

n 

^ i=i 

(6.28) ■[(l-q)I(y i -x' i P q <0) 

+ ql(y i -x' i p q >0)}}. 

M-quantile regression uses influence functions for es- 
timating /3q by solving the equations 

n 

J2^ q (r iq )xi = 0; 

i=l 

(6.29) 

V q (r iq ) = 2V{s- 1 r iq )[{l - q)I{r ig < 0) 

+ ql(r iq >0)}, 

where s is a robust estimate of scale, and V& is an ap- 
propriate influence function. The (unique) solution 
j3 q of (6.29) is obtained by an iterative reweighted 
least square algorithm. Note that each sample value 
(yi,Xi) lies on one and only one of the quantiles 
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m q (xi) = x'ifig (which follows from the fact that the 
quantiles are continuous in q). 

How is the M-quantile theory used for SAE? Sup- 
pose that the sample consists of unit level obser- 
vations {yij,Xij\i = l,...,m,j = l,...,ni}. Identify 
for unit the value qij such that y^-p qij =Uij- 

A predictor of the mean 9i in area i is obtained by 
averaging the quantiles q^ over the sampled units 
j £ Si and computing 




(6.30) 

rii 

3=1 

Alternatively, one can average the vector coefficients 
/3 gi .and replace /% in (6.30) by the mean j3 i = 

Y^jLi fiqiJ n i- The vectors pq. or P i account for dif- 
ferences between the areas, similarly to the random 
effects under the unit level model (5.3). 

The use of this approach is not restricted to the es- 
timation of means, although it does assume continu- 
ous y- values. For example, the distribution function 
in area i can be estimated as = 

N^lEj^ I(Vij <t) + E Hsi ItfJi < *)]• Cham- 
bers and Tzavidis (2006) develop unconditional and 
area specific estimators for the variance of the M- 
quantile estimators (6.30) assuming Pg. (or /3J is 
fixed, and estimators for the bias under the linear 
model E(y ij \x i j) =x' ij (3 i . 

The M-quantile approach does not assume a para- 
metric model, although it assumes that the quantiles 
are linear in the covariates in the theory outlined 
above. Clearly, if the unit level model (5.3) holds, 
the use of the model is more efficient, but the au- 
thors illustrate that the M-quantile estimators can 
be more robust to model misspecification. Notice in 
this regard that the approach is not restricted to 
a specific definition of the small areas. It accounts 
also for possible outliers by choosing an appropriate 
influence function in the estimating equation (6.29). 
On the other hand, there seems to be no obvious way 
of how to predict the means or other target quanti- 
ties for nonsampled areas. A possible simple solution 
would be to set q = 0.5 for such areas or weight the 
g-values of neighboring sampled areas, but it raises 
the question of how to estimate the corresponding 
PMSE, unless under a model. 

Use of penalized spline regression Another way 
of robustifying the inference is by use of penalized 



spline (P-spline) regression. The idea here is to avoid 
assuming a specific functional form for the expecta- 
tion of the response variable. Suppose that there is 
a single covariate x. The P-spline model assumes 
y = mo(x) + e, E(e) = 0, Var(e) = erf. The mean 
mo(x) is taken as unknown and approximated as 

m(x; p, 7) = /3 + Pix + ■ ■ ■ + p p x p 

K 

(6.31) +Y,lk{x - K k f + - 

k=l 

(x-K k f + =max[0,(x-K k )P], 

where p is the degree of the spline, and K\ < ■ • ■ < 
Kk are fixed knots. For large K and good spread of 
the knots over the range of x, spline (6.31) approxi- 
mates well most smooth functions. It uses the basis 
[1, x, ... , x p , (x — K\f + , . . . , (x — Kk)^ to approxi- 
mate the mean, but other bases can be considered, 
particularly when there are more covariates. 

Opsomer et al. (2008) use P-spline regression for 
SAE by treating the 7-coefficients in (6.31) as addi- 
tional random effects. Suppose that the data consist 
of unit-level observations, {yij,Xij;i = 1, . . . ,m, j = 
1, . . . ,rjj}. For unit j in area i, the model considered 
is 

Vij = Po + PiXij H h P v x\- 

(6.32) 

K 

+ ^lk(Xij ~ Kk)+ + Ui + Eij, 
fc=l 

where the UjS are the usual area random effects 
and EijS are the residuals. Let u = («i, . . . , u m )', 7 = 
(71, . . . , 7a:)'. Defining dij = 1 (0) if unit j is (is 
not) in area i and denoting dj = (dij, . . . , d m j)' and 
D = [d\, . . . , d n ]' , the model holding for the vector y 
of all the response values can be written compactly 
as 

y = XP + Z7 + Du + e; 
(6.33) 7^(0,^), 

u~(0,cr^I m ), e~(0,o-gl„), 

where X = [xf\. . . , xjfi]' with x[ p) = (1, x u . . . , xf )', 
and Z = [zi, z n )' with z t = [(ajj - K{f + , . . . , (xi - 
K K f + )}'. The model (6.33) looks similar to (6.25) 
but the responses yij are not independent between 
the areas because of the common random effects 7. 
Nonetheless, the BLUP and EBLUP of {P,u,i) can 
be obtained using standard results; see the article for 
the appropriate expressions. The small area EBLUP 



18 



D. PFEFFERMANN 



are obtained as 

^P-spline 
j, EBLUP 



P-spline -p'x^+iZi + Uil 



(6.34) 



» 



leu, 



The use of this approach requires that the co- 
variates are known for every element in the pop- 
ulation. Opsomer et al. (2008) derive the PMSE of 
the EBLUP (6.34) correct to second order for the 
case where the unknown variances are estimated by 
REML, and an estimator of the PMSE with bias 
correct to the same order. The authors develop also 
a nonparametric bootstrap algorithm for estimating 
the PMSE and for testing the hypotheses a\ = 
and (7^ = of no random effects. Rao, Sinha and 
Roknossadati (2009) use a similar model to (6.33), 
but rather than computing the EBLUP under the 
model, the authors propose predictors that are ro- 
bust to outliers, similar (but not the same) to the 
methodology developed by Sinha and Rao (2009) 
for the mixed linear model described in Section 6.5. 
Jiang, Nguyen and Rao (2010) show how to select an 
appropriate spline model by use of the fence method 
described in Section 8. 

Use of empirical likelihood in Bayesian inference 
Chaudhuri and Ghosh (2011) consider the use of em- 
pirical likelihoods (EL) instead of fully parametric 
likelihoods as another way of robustifying the infer- 
ence. When combined with appropriate proper pri- 
ors, it defines a semiparametric Bayesian approach, 
which can handle continuous and discrete outcomes 
in area- and unit-level models, without specifying 
the distribution of the outcomes as under the classi- 
cal Bayesian approach. Denote by 6 = (#i, . . . ,6 m )' 
and y = (y± , . . . , y m )' the area parameters and the 
corresponding direct estimators, and by r = (ti, . . . , 
T m ) the "jumps" defining the cumulative distribu- 
tion of yi, so that ^2=i r « = 1- The EL is Le = 
Y\iLi T i aR d for given moments E(yi\9i) = k(9i), 
V&r(yi\6i) = V(9i), the estimate f{0) is the solution 
of the constrained maximization problem 



max 

Tl,...,T m 



n 



r,;, 



s.t. Ti>o, yv, = 1, 



(6.35) 



[Vi - kW 
V(0i) 



1 



0. 



Under the area model (5.1) k(6i) = 8i = x^/3 + Uj 
and V(6i) = a 2 Di . The authors assume proper priors 
for (/3, ui, . . . , u m , a 2 ) and hence for 9, thus guaran- 
teeing that the posterior distribution Tr(6\y) is also 
proper. For given 8 the constrained maximization 
problem (6.35) is solved by standard methods (see 
the article) , and by combining the EL with the prior 
distributions, observations from the posterior distri- 
bution Tr(0\y) are obtained by MCMC simulations. 

For the unit-level model (5.3), E(yij\6ij) = k(9ij) = 
x-^/3 + Ui and Var(yjj|%) = U(%) = of. Denoting 
by the "jumps" of the cumulative distribution 
in area i, the EL is defined in this case as Le = 

i\ZiUf=i nj = i\Zi T (r)> and for § iven = 

(9n, di^J, f(j)(0) = [fn(9), . . . ,t irH (6)]' is the so- 
lution of the area specific maximization problem 



in 

max I I ■ 



(6.36) 



S.t. Tij>0, y^Tjj = 1, 

rii 

^Tijiyij -k(0ij)] = 0, 



j'=i 



[va - H&a)]' 



i=l 



The authors applied the procedure for estimating 
state-wise median income of four-person families in 
the USA, using the area-level model. Comparisons 
with the census values for the same year reveal much 
better predictions under the proposed approach com- 
pared to the direct survey estimates and the HB 
predictors obtained under normality of the direct 
estimates. 

Best predictive SAE In the three previous ap- 
proaches reviewed in this section, the intended ro- 
bustification is achieved by relaxing some of the model 
assumptions. Jiang, Nguyen and Rao (2011) pro- 
pose instead to change the estimation of the fixed 
model parameters. The idea is simple. In classical 
model-based SAE the EBLUP or EB predictors are 
obtained by replacing the parameters in the expres- 
sion of the BP by their MLE or REML estimators. 
Noting that in SAE the actual target is the predic- 
tion of the area means, and the estimation of model 
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parameters is just an intermediate step, the authors 
propose to estimate the fixed parameters in such a 
way that the resulting predictors are optimal under 
some loss function. 

Consider the area-level model (5.1) with normal 
errors, and suppose first that a 2 is known. Under 
the model, E{yi) = but suppose that the model 
is misspecified and E{yi) = fa, such that 9{ = fa + Ui, 
i = 1, ... ,m. Let 9{ be a predictor of 6i, and define 
the mean square prediction error to be MSPE(#) = 
^2iLiE(6i — 6i) 2 , where the expectation is under 
the correct model By (5.2), the MSPE of the BP 
for given /3 is MSPE[0(/3)] = E{£Zibm + (1 - 
7i)x^/3 — 6i] 2 }- The authors propose minimizing the 
expression inside the expectation with respect to 
(3, which is shown to be equivalent to minimizing 

£™i[(i - nfiAP? - 21X1(1 - T*)W y ield - 

ing the "best predictive estimator" (BPE) 



(6.37) p 



^(1 -7i) 2 Xi2/i. 



i=i 



Notice that unless Var£)(ej) = o- 2 Di = a 2 D , (3 differs 
from the commonly used GLS estimator under the 

model (5.1); /3 G LS = E™1 H^iY 1 YT=i li^iVi- The 
"observed best predictor" (OBP) of 0j is obtained 
by replacing /3gls by f3 in the BP (5.2) under the 
model (5.1). 

The authors derive also the BPE of ip = (/3',o- 2 )' 
for the case where a 2 is unknown, in which case 
the OBP is obtained by replacing a 2 and /3qls by 
the BPE of ip in (5.2). Another extension is for the 
unit level model (5.3), with the true area means 
and MSPE defined as 0* = % and MSPE[#(V>)] = 
YZiEd&W ~ Gi] 2 , respectively, where ij, = (/?', 
a 2 , a 2 )' and Ee>{-) is the design (randomization) ex- 
pectation over all possible sample selections (Sec- 
tion 4.1). The reason for using the design expecta- 
tion in this case is that it is almost free of model 
assumptions. Theoretical derivations and empirical 
studies using simulated data and a real data set il- 
lustrate that the OBP can outperform very signifi- 
cantly the EBLUP in terms of PMSE if the under- 
lying model is misspecified. The two predictors are 
shown to have similar PMSE under correct model 
specification. 

6.7 Prediction of Ordered Area Means 

Malinovsky and Rinott (2010) consider the follow- 
ing (hard) problem: predict the ordered area means 



■iji = fj, + Ui + ei = Qi + ei [special case of (5.1)], with 

itj H(0,a 2 ), ei G(0,a 2 ); H and G are general 
distributions with zero means and variances a 2 and 
a 2 . To illustrate the difference between the predic- 
tion of ordered and unordered means, consider the 
prediction of 6t m \ = maxj{#j}. If #j satisfies E(8i\8i) = 

6i,i = 1, . . . ,m, then i?[maxj{#j}|{6>j}] > 9r m -\ so that 
the largest estimator overestimates the true largest 
mean. On the other hand, the Bayesian predictors 
9* = E[6i\{dj}} satisfy E[maxi{9*}} < E{6 [m) ), an 
underestimation in expectation. 

Wright, Stern and Cressie (2003) considered the 
prediction of ordered means from a Bayesian per- 
spective, but their approach requires heavy numeri- 
cal calculations and is sensitive to the choice of pri- 
ors. Malinovsky and Rinott (2010) compare three 
predictors of the ordered means under the frequen- 
tist approach, using the loss function L{9r.\,9t.\) = 

£™i(%) " and the Bayes risk £[L(0q, ( .))]. 

Let 0i define the direct estimator of 0i and 9u^ the 
ith ordered direct estimator (statistic). The predic- 
tors compared are 



5(1) 



(6.38) 9®{5) = 6§® + (l-8)8, = ^/™; 



i=l 



The results below assume that a 2 and a 2 are known 
and that fi is estimated by 9. 

~\k] 

Denote by 9, J the predictor of the ordered means 



(k) 

when using the predictors 9^ , k = 1,2,3, and let 

7 = °u( a u + a e)~ 1 be the shrinkage coefficient when 
predicting the unordered means [equation (5.2)]. The 
authors derive several theoretical comparisons. For 
example, if 7 < (m — l) 2 / (m + l) 2 , then 



(6.39) 



E[L0®V),eQ)] 

<#[L(£H»0(.))] for all 7<(5< 1. 



Noting that lim m ^. 00 [(m - l) 2 /(m + l) 2 ] = 1, it fol- 
lows that (6.39) holds asymptotically for all 7, and 
the inequality 7 < 5 < 1 implies less shrinkage of the 
direct estimators toward the mean. In particular, the 

optimal choice of 5 for 6^ (5) satisfies linx^—^oo 



1/2 



'(1) 



< 



'(2) 



< 



< 



'(m) 



under the area-level model 



7 

The results above assume general distributions H 
and G. When these distributions are normal, then 
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for m = 2, E[L(9^, 6 (.j)] < ^[L^J 2 ]^),^.))] for all 
5. A conjecture supported by simulations is that this 
relationship holds also for m > 2. However, the sim- 
ulations suggest that for sufficiently large m (e.g., 

m > 25), flj 3 ] is efficiently replaced by #[ 2l (7 1/2 ). The 
last two conclusions are shown empirically to hold 
also in the case where a\ is unknown and replaced 
by the MOM variance estimator. 

Remark 5. The problem of predicting the or- 
dered means is different from ranking them, one 
of the famous triple- goal estimation objectives in 
SAE. The triple-goal estimation consists of produc- 
ing "good" area specific estimates, "good" estimates 
of the histogram (distribution) and "good" estimates 
of the ranks. See Rao (2003) for discussion. Judkins 
and Liu (2000) considered another related problem 
of estimating the range of the area means. The au- 
thors show theoretically and by simulations that the 
range of the direct estimators overestimates the true 
range, whereas the range of the empirical Bayes es- 
timators underestimates the true range, in line with 
the discussion at the beginning of this section. The 
bias is much reduced by use of a constrained em- 
pirical Bayes estimator. For the model considered 
by Malinovsky and Rinott (2010), the constrained 
estimator is obtained by replacing the shrinkage co- 
efficient 7 = o 2 u (al + al)' 1 in (5.2) by 7 ^ r y~ 1 ^ 2 , 
which again shrinkages less the direct estimator. 

6.8 New Developments for Specific Applications 

In this section I review two relatively new appli- 
cations of SAE; assessment of literacy and poverty 
mapping. The latter application, in particular, re- 
ceived considerable attention in recent years. 

Assessment of literacy The notable feature of as- 
sessing literacy from a literacy test is that the pos- 
sible outcome is either zero, indicating illiteracy, or 
a positive continuous score measuring the level of 
literacy. Another example of this kind of data is the 
consumption of illicit drugs, where the consumption 
is either zero or a continuous measure. In both ex- 
amples the zero scores are "structural" (true) ze- 
roes. The common models used for SAE are not ap- 
plicable for this kind of responses if the proportion 
of zeroes is high. Pfeffermann, Terryn and Moura 
(2008) consider the estimation of the average liter- 
acy score and the proportion of people with positive 
scores in districts and villages in Cambodia, a study 
sponsored by the UNESCO Institute for Statistics 
(UIS). Denote by yij k the test score of adult k from 



village j of district i and by r^ k a set of covariates 
and district and village random effects. The follow- 
ing relationship holds: 

E(Vijk\rijk) = E(y ijk \r ijk ,y ijk > 0) 

(6.40) 

• P*(yijk > 0\rijk)- 
The two parts in the right-hand side of (6.40) are 
modeled as E[y ijk \r ijk ,y ijk > 0] = x' ijk (3 + ut + v ij} 
where (uj, Vij) are district and nested village random 
effects, Pr(yij k > 0|r 



% 3 i -= Pijk] logit (pijfc) = 7'zj,fc + 
defines a set of covariates which 



u* +v*j, where z ijk 



may differ from Xij k and (u*,v*-) are district and 
nested village random effects, which are correlated 
respectively with (ui,Vij). The village and district 
predictors of the average score and the proportion 
of positive scores are obtained by application of the 
Bayesian approach with noninformative priors, us- 
ing MCMC simulations. The use of the Bayesian ap- 
proach enables one to account for the correlations 
between the respective random effects in the two 
models, which is not feasible when fitting the two 
models separately. The area predictors are obtained 
by imputing the responses for nonsampled individ- 
uals by sampling from their posterior distribution, 
and adding the imputed responses to the observed 
responses (when observations exist). 

Remark 6. Mohadjer et al. (2007) estimate the 
proportions 9y of adults in the lowest level of liter- 
acy in counties and states of the USA, by modeling 
the direct estimates pij in county j of state i as pij = 
9ij + £ij, and modeling logit(%) = + u\ + 
with m and defining state and county random 
effects. The state and county estimates are likewise 
obtained by MCMC simulations with noninforma- 
tive priors. Note that this is not a two-part model. 

Poverty mapping The estimation of poverty indi- 
cators in small regions is of major interest in many 
countries across the world, initiated and sponsored 
in many cases by the United Nations and the World 
Bank. In a celebrated article (awarded by the Cana- 
dian Statistical Society as the best paper published 
in 2010 in The Candian Journal of Statistics) , Molina 
and Rao focus on estimation of area means of non- 
linear poverty measures called FGT defined as 

1 N * 



(6.41) 



F, 



aij 



Ei 



x I(E i:j < z) 



a 



0,1,2, 
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where Eij is a measure of welfare for unit j in area i 
such as income or expenditure, z is a poverty thresh- 
old under which a person is considered "poor" (e.g., 
60% of the nation median income) and /(•) is the 
indicator function. For a = 0, F a i is the proportion 
under poverty. For a = 1 , F a i measures the "poverty 
gap," and for a = 2, F a { measures "poverty sever- 
ity." 

For a = 1,2 it is practically impossible to assign 
a distribution for the nie&sures Fadji cind in order 
to estimate the means F a i in sampled and nonsam- 
pled areas, Molina and Rao (2010) assume the ex- 
istence of a one-to-one transformation yij = T(Eij) 
such that the transformed outcomes y^ satisfy the 
unit level model (5.3) with normal distribution of 
the random effects and the residuals. Notice that 
F aij = [1 - \T- l { yij )] a x IlT-^yij) < z] =: h a ( yij ). 
For sampled units j G Sj F a ij is known, and for 
the nonsampled units k G rj, the missing measures 
are imputed by the EBP F™P = E[h a (y lk )\y s ] = 

YliLi ha{yfk )/L with large L, where y s defines all 

the observed outcomes. The predictions yf^ are ob- 
tained by Monte Carlo simulation from the condi- 
tional normal distribution of the unobserved out- 
comes given the observed outcomes under the model 
(5.3), using estimated parameters ij) = (/3', a^, of)'- 
The PMSE of the EBP F™P = Faij + 

Ylker Fafo P ]/Ni is estimated similarly to the first 
step of the double-bootstrap procedure described 
in Section 6.1. Model- and design-based simulations 
and application to a real data set from Spain using 
the transformation y^ = log(Eij) demonstrate good 
performance of the area predictors and the PMSE 
estimators. 

Remark 7. The World Bank (WB) is currently 
using a different method, under which all the popu- 
lation values y^ are simulated from model (5.3) with 
estimated parameters (including for sampled units), 
but with random effects for design clusters, which 
may be different from the small areas. As discussed 
and illustrated by Molina and Rao (2010), the use 
of this procedure means that all the areas are prac- 
tically considered as nonsampled, and the resulting 
predictors of the means F a i in (6.41) are in fact syn- 
thetic predictors since the random effects and the 
area means of the residuals cancel out over the L 
simulated populations. Simulation results in Molina 
and Rao (2010) show that the WB method produces 
predictors with much larger PMSE than the PMSE 
of the EBP predictors proposed by them. 



7. SAE UNDER INFORMATIVE SAMPLING 
AND NONRESPONSE 

All the studies reviewed in this paper assume, at 
least implicitly, that the selection of areas that are 
sampled and the sampling designs within the se- 
lected areas are noninformative, implying that the 
model assumed for the population values applies 
also to the sample data with no sampling bias. This, 
however, may not be the case, and as illustrated 
in the literature, ignoring the effects of informa- 
tive sampling may bias the inference quite severely. 
A similar problem is not missing at random (NMAR) 
nonresponse under which the response probabilities 
depend on the missing data, which again can bias 
the predictions if not accounted for properly. These 
problems received attention under both the frequen- 
tist and the Bayesian approaches. 

Pfeffermann and Sverchkov (2007) consider the 
problem of informative sampling of areas and within 
the areas. The basic idea in this article is to fit a 
sample model to the observed data and then ex- 
ploit the relationship between the sample model, 
the population model and the sample-complement 
model (the model holding for nonsampled units) in 
order to obtain unbiased predictors for the means in 
sampled and nonsampled areas. 

Consider a two-stage sampling design by which m 
out of M areas are selected in the first stage with 
probabilities 7Tj = Pr(z G s), and ni out of iVj units 
are sampled from the ith selected area with proba- 
bilities 7Tj|j = Pr(j G Si\i G s). Denote by Ij and Ijj 
the sample indicator variables for the two stages of 
sampling and by Wi = l/ni and wju = the first 

and second stage sampling weights. Suppose that 
the first level area random effects {u±, . . . ,um} are 
generated independently from a distribution with 
p.d.f. f p (ui), and that for given Ui the second level 
values {ya, ■ ■ ■ ,ViNi} are generated independently 
from a distribution with p.d.f. f p (yij\xij,Ui). The 
conditional first-level sample p.d.f. of U{, that is, the 
p.d.f. of Ui for area i G s is 

fs(Ui) d = f(Ui\Ii = 1) 

(7.1) = Pv(Ii = 1K)/ P K)/Pr(/; = 1) 

= E s (wi)f p (ui)/E s (wi\ui). 

The conditional first-level sample- complement p.d.f. 
of Ui, that is, the p.d.f. for area i ^ s is 

fc{Ui) d = f{Ui\Ii =0) 

(7.2) 

= ?T{I i = U\u i )f p {u i )/¥T(I i = G). 
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Note that the population, sample and sample- comple- 
ment p.d.f.s are the same if Pr(lj = l\u{) = Pr(ij = 
1), in which case the area selection is noninforma- 
tive. Similar relationships hold between the sam- 
ple p.d.f., population p.d.f. and sample-complement 
p.d.f. of the outcomes yij within the selected areas, 
for given values of the random effects. 

Pfeffermann and Sverchkov (2007) illustrate their 
approach by assuming that the sample model is the 
unit-level model (5.3) with normal random effects 
and residuals, and that the sampling weights within 
the selected areas have sample model expectations, 

EsixWjUpcij , yij , Itj, 1{ — 1) 

(7.3) = Esitwjiilxij^ijJi = 1) 
= A;iexp(a'xjj + byij), 

where fe, = N^Ui)' 1 Ylf=i exp(-a'x ij - - byij)/Ni, 
and o and b are fixed constants. No model is as- 
sumed for the relationship between the area selec- 
tion probabilities and the area means. The authors 
show that under this model and for given parame- 
ters {/?', b, a 2 , of }, the true mean Yj in sampled area 
% can be predicted as 

Yi = E p (Yi\D„Ii = i) 

(7.4) = -L{(JVi - + mivi + (Xt - XiYp] 

+ (Ni - ni )ba 2 e }, 

where D s represents all the known data and 6i = 
iii + Xifi is the optimal predictor of the sample model 
mean 0i = X-/3 + m. The last term in (7.4) corrects 
for the sample selection effect, that is, the differ- 
ence between the sample-complement expectation 
and the sample expectation in sampled areas. 

The mean Y& of area k not in the sample can be 
predicted as 



(7.5) 



E p (Y k \D s ,I k = 0) 
= X' k P + bal 



+ 



^2(wi - i)ui / ^2( Wi - 1) 



The last term of (7.5) corrects for the fact that the 
mean of the random effects in areas outside the sam- 
ple is different from zero under informative selection 
of the areas. The authors develop test procedures 
for testing the informativeness of the sample selec- 
tion and a bootstrap procedure for estimating the 



PMSE of the empirical predictors obtained by sub- 
stituting the unknown model parameters by sample 
estimates. The method is applied for predicting the 
mean body mass index (BMI) in counties of the USA 
using data from the third national health and nutri- 
tion examination survey (NHANES III). 

Malec, Davis and Cao (1999, hereafter MDC) and 
Nandram and Choi (2010, hereafter NC) likewise 
consider the estimation of county level BMI statis- 
tics from NHANES III, with both articles account- 
ing for within-area informative sampling in a simi- 
lar manner, and the latter article accounting, in ad- 
dition, for informative nonresponse. Another differ- 
ence between the two articles is that MDC consider 
binary population outcomes (overweight /normal sta- 
tus), with logistic probabilities that contain fixed 
and multivariate random area effects, whereas NC 
assume a log-normal distribution for the continu- 
ous BMI measurement, with linear spline regressions 
containing fixed and random area effects defining 
the means. In order to account for sampling effects, 
both articles assume that each sampled unit rep- 
resents K — 1 other units (not sampled) within a 
specified group (cluster) of units, with unit j se- 
lected with probability nf^ that can take one of the 
G observed values tt*, g = 1, . . . , G in that group. 
The groups are defined by county and demographic 
characteristics. Specifically, let 5j = 1 (0) if unit j is 
sampled (not sampled). The MDC model for a given 
group assumes 

Sj\K, 7r* i} ~ d Bernoulli^)), j = 1, . . . , K; 

Pr Ki) = n g\ 9 9v>Vj =V) = °9yi 
(7.6) y = 0,l;g = l,...,G, 

Pr(y j =y\p)= p y(l-p) 1 -y, 

0<p<l;p(K) = l. 

It follows that 



(7.7) 



oc 



'(./) 
p y(i- P ) 



— { T gy/^g)/ 



MDC show that the MLE of 9 gy is 9 

^2 g *=i( T g*y I Kg*) where T gy is the sample frequency 
of n* in the group for units with overweight status 
y. They plug the estimate into (7.7) and then into 
the full likelihood that includes also the distribution 
of random effects contained in a logit model for p. 
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NC generalize model (7.6) by allowing the out- 
come to be continuous, assuming Pr(-7r^ = ir*\ 

e g{y)iV) =Qg{y), -°o < y < oo where 6 9 {y) =6 g i for 
a i-i < V < a h an d replacing the Bernoulli distribu- 
tion for y by a continuous p.d.f. To account for infor- 
mative nonresponse, the authors assume that the re- 
sponse probabilities are logistic with logit(p^) = 
voi + vuyij, where {(voi,vu)} is another set of ran- 
dom effects having a bivariate normal distribution. 

Remark 8. As the notation suggests, both MDC 
and NC use the full Bayesian approach with appro- 
priate prior distributions to obtain the small area 
predictors under the respective models. See the ar- 
ticles for details. The authors do not consider infor- 
mative sampling of the areas. 

I conclude this section by describing an article 
by Zhang (2009), which uses a very different model 
from the other models considered in the present pa- 
per. The article considers the estimation of small 
area compositions in the presence of NMAR nonre- 
sponse. Compositions are the counts or proportions 
in categories of a categorical variable such as types 
of households, and estimates of the compositions 
are required for every area. Zhang deals with this 
problem by assuming that the generalized SPREE 
model (GSPREE) developed in Zhang and Cham- 
bers (2004) holds for the complete data (with no 
missingness). In order to account for the nonresponse, 
Zhang assumes that the probability to respond is 
logistic, with a fixed composition effect £ c and a 
random area effect b a as the explanatory variables. 
(Same probability for all the units in a given cell de- 
fined by area x category.) The model depends there- 
fore on two sets of random effects, one set for the un- 
derlying complete data, with a vector of correlated 
multivariate normal composition effects in each area 
defining the GSPREE model, and the other set for 
the response probabilities. Zhang (2009) estimates 
the small area compositions under the extended 
GSPREE using the EM algorithm, and estimates 
the PMSE under the model, accounting for the fixed 
and random effects estimation. The approach is ap- 
plied to a real data set from Norway. 

8. MODEL SELECTION AND CHECKING 

Model selection and checking is one of the major 
problems in SAE because the models usually contain 
unobservable random effects, with limited or no in- 
formation on their distribution. Notice that classical 



model selection criteria such as the AIC do not ap- 
ply straightforwardly to mixed models because they 
use the likelihood, which requires specification of 
the distribution of the random effects, and because 
of difficulties in determining the effective number 
of parameters. In what follows I review several re- 
cent studies devoted to model selection and valida- 
tion from both a frequentist and Bayesian perspec- 
tive. These should be considered as supplements to 
"ordinary" checking procedures based on graphical 
displays, significance testing, sensitivity of the com- 
puted predictors and their PMSEs to the choice of 
the likelihood and the prior distributions, and com- 
parison of the model-dependent predictors with the 
corresponding model free direct estimators in sam- 
pled areas. Such model evaluation procedures can 
be found in almost every article on SAE; see, for 
example, Mohadjer et al. (2007) and Nandram and 
Choi (2010) for recent diverse applications. 

Vaida and Blanchard (2005) study the use of the 
AIC assuming model (6.1) with Var(iij) = Q, 
Var(ej) = a 2 I ni . The authors distinguish between in- 
ference on the marginal model with focus on the 
fixed effects, and inference on the model operating 
in the small areas with the associated vector random 
effects Ui. For the first case, the model can be writ- 
ten as a regression model with correlated residuals: 
yi = Xtf + vf, Vi = Zim + e, ~ N(0, ZiQZ[ + a 2 I ni ). 
For this case, the classical (marginal) AIC, mAIC = 
— 21ogg(y|'0MLE) + 2-P applies, where y is the vector 
of all the observations, ff(y|t/ , MLE) is the marginal 
likelihood evaluated at the MLE of ip, the vector 
containing (3, a 2 and the unknown elements of Q 
and P = dim(^). Gurka (2006) validates by simula- 
tions that one can use also in this case the mAIC 
with "0REML i despite the use of different fixed effects 
design matrices under different models. 

For the case where the focus is the model operat- 
ing at the small areas, Vaida and Blanchard (2005) 
propose using a conditional AIC, which, for a given 
likelihood g(y\tp,u), is defined as 

cAIC = -21og 5 (y|^MLE,n) + 2P*; 

(8.1) 

_ n(n-k-l)(p+l)+n(k + l) 
(n-k)(n-k-2) 

where k is the number of covariates, u = E(u\^mle, y) 
is the EBP of u and p = tr(il) with H defining the 
matrix mapping the observed vector y into the fitted 
vector y = X/3 + Zu, such that y = Hy. Notice that 
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under this definition of the cAIC, the UjS are addi- 
tional parameters. A conditional AIC for the case 
where tp is estimated by REML is also developed. 
The article contains theoretical results on proper- 
ties of the cAIC and empirical results illustrating its 
good performance. The use of (8.1) is not restricted 
to mixed linear models with normal distributions of 
the error terms, and it can be used to select the 
design matrices Xj and Z%. 

Pan and Lin (2005) propose alternative goodness- 
of-fit test statistics for the GLMM, based on esti- 
mated cumulative sums of residuals. Utilizing the 
notation for model (6.1), the GLMM assumes the 
existence of a one-to-one link function <?(•), satis- 
fying g[E(yij\ui)] = x^/3 + z'^m, where x»j and 
are the rows of the matrices Xi and Zi correspond- 
ing to unit £ Sj. The unconditional predictor 
of yij is mij(ij)) = E(y i:j ) = E Ui [g- 1 {x' ij P + z^is*)], 

which is estimated by mjj(^), The estimated model 
residuals are therefore = — m^- (?/>), and they 
are computed by numerical integration. The authors 
consider two statistics based on the distributions of 
aggregates of the residuals, 

m rii 

W(x) = rT 1 ' 2 J ( X *J ^ x ) e *i' 

i=l j=l 

(8.2) 

m rii 

W g {r) = n- 1 ' 2 Yl ^ r )e»i > 

i=l j=i 

where I(xy < x) = nf=i I( x iji — x l)- I n particular, 
for testing the functional form of the Zth covari- 
ate, one may consider the process Wi(x) = n~ 1 / 2 ■ 
Yli=i Y^j=i Ifaijl < x ) e iji which is a special case 
of W(x). The authors develop a simple approxima- 
tion for the null distribution of Wi(x) and use it 
for visual inspection by plotting the observed values 
against realizations from the null distributions for 
different values of x, and for a formal test defined by 
the supremum Si = sup x \Wi(x)\. The statistic Si is 
used for testing the functional form of the determin- 
istic part of the model. To test the appropriateness 
of the link function, the authors follow similar steps, 
using the statistics W g (r) for visual inspection and 
S g = sup r \W g (r)\ for formal testing. As discussed 
in the article, although different tests are proposed 
for different parts of the model, each test actually 
checks the entire model, including the assumptions 
regarding the random components. 

The goodness-of-fit tests considered so far assume 
a given structure of the random effects, but are ran- 
dom effects actually needed in a SAE model ap- 



plied to a given data set? Datta, Hall and Man- 
dal (2011) show that if in fact the random effects 
are not needed and are removed from the model, it 
improves the precision of point and interval estima- 
tors. The authors assume the availability of k co- 
variates Xj = (xu, . . . , Xki), i = 1, . . . ,m (viewed ran- 
dom for the theoretical developments) and weighted 
area-level means = Y^Li^jVij] J2jLi w ij = 1 of 
the outcome with known weights and known sums 
W ir = YJjLi Wij,r = 2,...,q, q< k. The weights Wij 
are used for generating new area level means from 
bootstrap samples, and the sums Wi r are used for 
estimating model parameters by constructing appro- 
priate estimating equations. 

In order to test for the presence of random effects, 
the authors propose the test statistic 

m 

(8.3) T = Y,i W M^n~ l [m - Aitx^)] 2 , 
i=i 

where Xi(xi,ijj), 1 = 1,2 define the conditional mean 
and residual variance of y\x under the reduced model 
of no random effects, with estimated (remaining) pa- 
rameters tp. Critical values of the distribution of T 
under the null hypothesis of no random effects are 
obtained by generating bootstrap samples with new 
outcomes from the conditional distribution of y\x; ip 
for given (original) covariates and weights, and com- 
puting the test statistic for each sample. Empirical 
results indicate good powers of the proposed proce- 
dure and reduction in PMSE when the null hypoth- 
esis is not rejected. The procedure is applicable to 
very general models. 

Jiang et al. (2008) propose a class of strategies for 
mixed model selection called fence methods, which 
apply to LMM and GLMM. The strategies involve 
a procedure to isolate a subgroup of correct mod- 
els, and then select the optimal model from this 
subgroup according to some criterion. Let Qm = 
Qm(u,iPm) define a measure of "lack of fit" of a 
candidate model M with parameters tpM, such that 
E{Qm) is minimized when M is the true model. 
Examples of Qm are minus the loglikelihood or the 
residual sum of squares. Define Qm = Qm{u^m) = 
mi^ M£ ^ M Qm{v,iPm), and let M G M be such that 
Q M = minMeM Q M where M represents the set of 
candidate models. It is shown that under certain 
conditions, M is a correct model with probability 
tending to one. In practice, there can be more than 
one correct model and a second step of the proposed 
procedure is to select an optimal model among mod- 
els that are within a fence around Q M - Examples of 
optimality criteria are minimal dimension or mini- 
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mum PMSE. The fence is defined as Qm < Qm + 
c n a M M , where a M M is an estimate of the standard 

deviation of Qm — Qm> an< ^ Cn 1S a tuning coeffi- 
cient that increases with the total sample size. Jiang 
et al. (2008) discuss alternative ways of computing 
a M M and propose an adaptive procedure for choos- 
ing the tuning coefficient. The procedure consists 
of parametric bootstrapping new samples from the 
"full" model, computing for every candidate model 
MgM the empirical proportion p*(M, c n ) that it is 
selected by the fence method with a given c n , com- 
puting p*(c n ) = maxM<=MP*(M, c n ) and choosing c n 
that maximizes p*(c n ). 

Jiang et al. (2008) apply the method for select- 
ing the covariates in the area-level model (5.1) and 
the unit level model (5.3). Jiang, Nguyen and Rao 
(2010) apply the method for selecting nonparamet- 
ric P-spline models of the form (6.31). Selecting a 
model in this case requires selecting the degree of 
the spline p, the number of knots K and a smooth- 
ing parameter A used for estimation of the model 
parameters. 

So far I have considered model selection and di- 
agnostic procedures under the frequentist approach, 
but sound model checking is obviously required also 
under the Bayesian approach. Although this arti- 
cle is concerned with new developments, it is worth 
starting with a simulation procedure proposed by 
Dey et al. (1998) since it highlights a possible ad- 
vantage of the Bayesian approach in model checking. 
Let d define a discrepancy measure between the as- 
sumed model and the data, such as minus the first- 
stage likelihood of a hierarchical model. Denote by 
7/ b s the observed data and assume an informative 
prior. The procedure consists of generating a large 

(r) 

number R of new data sets y obs > r = 1, . . . , R under 
the presumed model via Monte Carlo simulations 
and comparing the posterior distribution of d|y bs 
with the distributions of d\y ^ s . Specifically, for each 
posterior distribution f(d\y^ s ) compute the vector 
of quantiles = qa}, ■ ■ ■ , qag (say a± = 0.025, . . . , 
qq = 0.975), compute q = Yl^=i /R and the Eu- 
clidean distances between q^ and q, and check 
whether the distance of the quantiles of the distri- 
bution of d|y bs from q is smaller or larger than, say, 
the 95th percentile of the R distances. 

Remark 9. The procedure is computationally 
intensive, and it requires informative priors to al- 
low generating new data sets, but it is very flexible 



in terms of the models tested and the discrepancy 
measure (s) used. A frequentist analog via paramet- 
ric bootstrap would require that the distribution of 
d does not depend on the model parameters, or that 
the sample sizes are sufficiently large to permit ig- 
noring parameter estimation. 

Bayarri and Castellanos (2007) investigate Bayes- 
ian methods for objective model checking, which re- 
quires noninformative priors for the parameters ip. 
The authors assume a given diagnostic statistic T 
(not a function of tp) and consider two "surprise 
measures" of conflict between the observed data and 
the presumed model; the p- value Pv h ^[T(y) > 
t(y bs)}, and the relative predictive surprise RPS = 
h[t(y Q \ ]S )]/ sup t [h(t)], where h{t) is some specified 
distribution. Denote by 9 the small area parameters. 
Writing f(y) = j f(y\9)g(9)d9, it is clear that defin- 
ing h(t) requires integrating 9 out of f(y\9) with 
respect to some distribution for 9. The prior g{9) 
cannot be used since it is also improper and the au- 
thors consider three alternative solutions: 1. Set the 
model hyper-parameters ift at their estimated value 
and integrate with respect to g(9\ip). This is basi- 
cally an application of empirical Bayes and /i EB (t) = 
f f(t\9)g(9\4>)d9. 2. Integrate 9 out by use of the 
posterior distribution g{9\y Q \ iS ). 3. Noticing that un- 
der the above two solutions, the data are used both 
for obtaining a proper distribution for 9 and for 
computing the statistic i(y bs)> the third solution re- 
moves the information in i(?/obs) from y Q b s by using 
the conditional likelihood /(y bs|iobs 5 9). The result- 
ing posterior distribution for 9 is then used to ob- 
tain the distribution h(t), similarly to the previous 
cases. The specified distribution h(t) under all three 
cases may not have a closed form, in which case it 
is approximated by MCMC simulations. See the ar- 
ticle for details and for illustrations of the approach 
showing, in general, the best performance under the 
third solution. 

Yan and Sedransk (2007) consider a specific model 
inadequacy, namely, fitting models that do not ac- 
count for all the hierarchical structure present, and, 
like the last article, restrict to noninformative pri- 
ors. The authors consider two testing procedures, 
both based on the predictive posterior distribution 
f(y\y hs) = f f(yWp(^\y bs)dip, where y and y obs 
are assumed to be independent given ijj. The first 
procedure uses the posterior predictive p- values, pij = 
P^(yij < yij\yobs)- The second procedure uses the p- 
values of a diagnostic statistic t(-) or a discrepancy 
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measure d(-) (see above), for example, the p- values 
Pr[i(y) > i(y bs)|2/obs]- The authors analyze the sim- 
ple case of a balanced sample where the fitted model 

is yijlfj,,^ 1 ^' N(fjL,(f)), i = l,...,m, j = l,...,n . It 
is shown that if the model is correct, then as N = 
uqiji — >• oo the distributions of y Q bs and y|y bs are the 
same, and the p- values ptj are distributed uniformly, 
as revealed in a Q-Q plot. On the other hand, if 

the true model is the two-level model Uij\0i,(f>a 

N(6i,4> ), 9i\fi ,A 1A ~' N(p ,A ), then as N ->• oo 
the mean and variance of the two models still agree, 
but not the covariances, so that it is the ensem- 
ble of the pijS or their Q-Q plot against the uni- 
form distribution, but not individual p-values, that 
permits distinguishing the two models. This, how- 
ever, is only effective if the intra-cluster correlation 
is sufficiently high, and the number of areas suffi- 
ciently small. Similar conclusions hold when com- 
paring a two-stage hierarchical model with a three- 
stage model, and when applying the second testing 
procedure with the classical ANOVA F test statistic 
as the diagnostic statistic, that is, when computing 
Pr[F(y)>F(y obs )|y obs ]. 

Yan and Sedransk (2010) consider a third proce- 
dure for detecting a missing hierarchical structure, 
which uses Q-Q plots of the predictive standardized 
residuals ru = F w J T^f' J against the standard 

[Var(i/jj |y bs j\ 

normal distribution. The conditions under which the 
procedure performs well in detecting a misspecified 
hierarchy are the same as above. 

Finally, I like to mention two articles that in a cer- 
tain way bridge between the frequentist and Bayesian 
approaches for model selection. The idea here is to 
set up a noninformative prior under the Bayesian 
approach so that the resulting posterior small area 
predictors have acceptable properties under the fre- 
quentist approach. This provides frequentist vali- 
dation to the Bayesian methodology, and the an- 
alyst may then take advantage of the flexibility of 
Bayesian inference by drawing observations from the 
posterior distribution of the area parameters. Both 
articles consider the area-level model (5.1), but the 
idea applies to other models. 

Datta, Rao and Smith (2005) assume a flat prior 
for j3 and seek a prior p(a 2 ) satisfying ^(V^hb) = 
PMSE[4(<%r E )] + o{m~ l ), where V mB = 
Var(#j|y bs) is the posterior variance of 6i, and 
PMSE[^(o^ RE )] is the frequentist PMSE of the 
EBLUP (or EB) when estimating a 2 u by REML. 
The expectation and PMSE are computed under the 



joint distribution of 9 and y under the model. The 
unique prior satisfying this requirement is shown to 
be 



(8.4) Pi {crl) oc (.r&i + ol? EtV^D, + O 

i=i 

The prior is area specific in the sense that different 
priors are required for different 

Ganesh and Lahiri (2008) extend the condition of 
Datta, Rao and Smith (2005) to a weighted combi- 
nation of the posterior expectations and the PMSEs, 
thus obtaining a single prior for all the areas. The 
authors seek a prior which for a given set of weights 
{ijji} satisfies 

m 

X>{#(^hb) - PMSE^^be)]} 
i=i 

(8.5) 

= o(l/m). 

The prior p(a^) satisfying (8.5) is shown to be 



2\2] 



p(*3)«X;[i/(oS m +*2) 



(8.6) 



i=l 



£^/(^+^)] 2 - 



i=l 



By appropriate choice of the weights {wj}, prior 
(8.6) contains as special cases the flat prior p(cr 2 ) = 
U{0, oo), the prior developed by Datta, Rao and 
Smith (2005) for a given area and the average mo- 
ment matching prior (obtained by setting cjj = 1). 

9. CONCLUDING REMARKS 

In this article I reviewed many new important de- 
velopments in design- and model-based SAE. These 
developments give analysts much richer and more 
versatile tools for their applications. Which approach 
should one follow in practice? Model-based predic- 
tors are generally more accurate and, as discussed in 
Section 4.3, the models permit predictions for non- 
sampled areas for which no design-based theory ex- 
ists. With everything else that can be done under 
a model, much of which reviewed in Sections 6-8, 
it seems to me that the choice between the two ap- 
proaches is clear-cut, unless the sample sizes in all 
the areas are sufficiently large, although even in this 
case models have much more to offer like, for exam- 
ple, in the case of measurement errors or NMAR 
nonresponse. This is not to say that design-based 
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estimators have no role in model-based prediction. 
To begin with, the design-based estimators are often 
the input data for the model, as under the area-level 
model. Design-based estimators can be used for as- 
sessing the model-based predictors or for calibrating 
them via benchmarking, and the sampling weights 
play an important role when accounting for infor- 
mative sampling. 

Next is the question of whether to follow the Bayes- 
ian approach (BA) or the frequentist approach (FA). 
I have to admit that before starting this extensive 
review I was very much in favor of FA, but the BA 
has some clear advantages. This is because one can 
generate as many observations as desired from the 
posterior distributions of the area parameters, and 
hence it is much more flexible in the kind of mod- 
els and inference possibilities that it can handle, for 
example, when the linking model does not match 
the conditional sampling model (Remark 2). Note 
also that the computation of PMSE (Bayes risk) 
or credibility intervals under BA does not rely on 
asymptotic properties. A common criticism of BA 
is that it requires specification of prior distributions 
but as emphasized in Section 8, Bayesian models 
with proper, or improper priors can be tested in a 
variety of ways. Another criticism is that the ap- 
plication of BA is often very computation intensive 
and requires expert knowledge and computing skills 
even with modern available software. While this crit- 
icism may be correct (notably in my experience), 
the use of FA methods when fitting the GLMM is 
also very computation intensive and requires simi- 
lar skills. Saying all this, it is quite obvious to me 
that the use of FA will continue to be dominant for 
many years to come because, except for few excep- 
tions, official statistical bureaus are very reluctant 
to use Bayesian methods. 

Where do we go from here? Research on SAE con- 
tinues all over the world, both in terms of new theo- 
ries and in applications to new intriguing problems, 
and I hope that this review will contribute to this re- 
search. The new developments that I have reviewed 
are generally either under BA or FA, and one pos- 
sible direction that I hope to see is to incorporate 
the new developments under one approach into the 
other. For example, use the EL approach under FA, 
use spline regressions under BA, account for NMAR 
nonresponse in FA or produce poverty mapping with 
BA. Some of these extensions will be simple; other 
may require more extensive research, and some may 
not be feasible, but this will make it easier for ana- 
lysts to choose between the two approaches. 
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